How to monitor Lambda functions

How to monitor Lambda functions

/ / / /
Published: September 20, 2017

As “serverless” application architectures have gained popularity, AWS Lambda has become the best-known service for running code on demand without having to manage the underlying compute instances. From an ops perspective, running code in Lambda is fundamentally different than running a traditional application. Most significantly from an observability standpoint, you cannot inspect system-level metrics from your application servers. But you can—and should—closely monitor performance and usage metrics from your Lambda functions (the individual services or bits of code you run in Lambda).

In this post, we’ll walk through some of the ways that serverless monitoring diverges from traditional application monitoring. And we’ll present a concrete example of monitoring the performance and usage of a simple Lambda function with Datadog.

What is Lambda?

AWS Lambda is an event-driven compute service in the Amazon cloud that abstracts away the underlying physical computing infrastructure, allowing developers to focus on their code rather than the execution environment. A Lambda function can be triggered by AWS events, such as an object being deleted from an S3 bucket; API calls via AWS API Gateway; or through manual invocation in the AWS user interface. Lambda adoption has flourished since its introduction in 2014, and a rich ecosystem is developing around Lambda and other serverless technologies.

Lambda is well integrated with several AWS services you may already be using, such as ELB, SES, and S3.

Serverless monitoring mindset

Monitoring applications running in AWS Lambda presents unique challenges when compared to monitoring a traditional application server. For starters, there is no long-lived host you can monitor, which means there is no place to drop a monitoring agent to collect telemetry data.

“Serverless” does not mean that there is no computer executing code, however. Rather, it means that developers do not need to provision and maintain application servers to run their code. The burden of patching, securing, and maintaining the infrastructure behind a Lambda function falls to Amazon Web Services. Deploying serverless code is as simple as uploading your application (and dependencies) to AWS and configuring some runtime constraints, like maximum memory allotted and max execution time.

Because of this abstraction, in a serverless deployment you don’t have access to all of the traditional system metrics (like disk usage and RAM consumption) that could inform you of the health of your system. But with proper instrumentation of your applications and supporting services, you can ensure that your systems are observable, even in the absence of metrics on CPU, memory, and the like.

Datadog dashboard for AWS Lambda monitoring
Datadog's built-in Lambda monitoring dashboard captures high-level metrics on function usage and performance.

Lambda performance metrics

Lambda performance metrics can be broken out into two groups:

We’ll tackle each in turn. Along the way, we’ll characterize metrics as “work” or “resource” metrics—for background on this distinction, refer to our Monitoring 101 posts on metric collection and alerting.

AWS metrics

MetricDescriptionMetric Type
durationDuration of a function’s execution in millisecondsWork: Performance
invocationsCount of executions of a functionWork: Throughput
errorsCount of executions that resulted in an errorWork: Error
throttlesCount of throttled invocation attemptsResource: Saturation

These metrics are available through the AWS CloudWatch console and give you the raw data about the execution of your functions. With these metrics alone, you can estimate projected execution costs, identify trends in execution frequency, and quickly identify when errors start to pile up.

Graph of Lambda metrics in the AWS console
Standard metrics from Lambda are available via the AWS CloudWatch console.

That being said, without additional metrics your insight into application performance will be somewhat limited. You can see, for example, that your function is executing slowly, but you won’t have much additional context to help you pinpoint the source of the slowdown.

As you will see, even minimal instrumentation of Lambda functions yields significant insight into application performance.

Custom metrics from Lambda functions

Beyond the out-of-the-box metrics provided by AWS, you will likely want to track performance and usage metrics that are unique to your use case and application. For example, if your function is interacting with an external API, you would likely want to track your API calls; likewise, if your Lambda function interacts with a database to manage state, you’d want to track reads and writes to that database.

In the context of web application performance monitoring, some of the metrics that are nearly universally valuable include:

  • requests (throughput)
  • responses (including specific error types)
  • latency
  • work done to service requests

Of course, choosing what to monitor will be largely dependent on your specific use case, your business, and any SLAs you may have in place.

To capture all of the above requires instrumenting the entry and exit points of your application, as well as instrumenting the code segments where the actual work is performed.

Instrumenting application internals

Consider a simple Lambda function hooked up to API Gateway that performs two kinds of work: accept a string and return the MD5 hash of the string to the caller, and accept an MD5 hash and return the original string.

From a high level, the application to be instrumented uses a Lambda request handler that is invoked each time a request comes in via Amazon API Gateway. The application logic is contained in four functions:

  • lambda_handler is the entry point for our application
  • read_s3 retrieves the data file from S3
  • hash_exists reads & searches the data file for a hash
  • response returns the requested string or hash, if the request is successful, along with an HTTP status code

We’ve also defined a helper function, log, to emit metrics in specially formatted loglines that Datadog’s integration is designed to ingest. The format is like so:

MONITORING|unix_epoch_timestamp|metric_value|metric_type|my.metric.name|#tag1:value,tag2

where metric_type can be gauge, count, histogram, or check. A brief refresher on metric types:

  • A gauge represents an instantaneous value.
  • A count represents a long-running counter that can be incremented or decremented over time.
  • A histogram generates several aggregate metrics (avg, count, max, min, p95, median) at one-second granularity to help describe the distribution of your data points.
  • A service check sends an integer value to describe the current state of the monitored service or function (0 for OK, 1 for warning, 2 for critical, 3 for unknown).

A valid log function that increments a metric counter by 1 might look like this:

def log(metric_name, metric_type='count', metric_value=1, tags=[]):
    print("MONITORING|{}|{}|{}|{}|#{}".format(
        int(time.time()), metric_value, metric_type, 'hasher.lambda.' + metric_name, ','.join(tags)
    ))

Instrumentation examples for other metric types are available in Datadog’s Lambda integration docs.

In the example above, we append the prefix hasher.lambda to every metric we send (e.g. hasher.lambda.requests or hasher.lambda.responses). Using a consistent metric prefix makes the metrics easy to find when we’re building metric graphs or alerts.

Using this log function, instrumenting your application is as easy as calling a function wherever you want to increment a metric.

Set collection points

With a log function defined and ready to be sprinkled about, you can start to think about what kinds of metrics would capture information that would help you measure and track the performance of your service. As previously mentioned, you typically will want to monitor application requests and responses, which provide a good starting point for placing instrumentation.

Counting requests

In our example application, capturing high-resolution metrics on the request rate would be very useful for understanding throughput, which is a key work metric for any application. The easiest way to capture the requests as they come in is to instrument the beginning of the request-handling code. The hasher service processes all requests via the lambda_handler() function, so capturing the count of requests is as simple as adding a call to log at the very beginning of that function.

A simple, clear name for this metric would be hasher.lambda.requests. Using our log function defined above, we can start collecting the request metric simply by adding a log line inside the lambda_handler() function:

def lambda_handler(event, context):
    log(metric_name='requests', tags=['hash-service'])

Now we have a metric tracking the request count for the Lambda function:

Graph of a request count metric in Datadog
A graph of requests served by the Lambda function over time.

Counting (and tagging) responses

The next logical place to drop calls to log() is wherever the Lambda function has a return statement (a response), since these are effectively the exit points of the Lambda function. So we can instrument our response function to increment a counter every time a response is returned to the client. These response metrics are emitted under the catchall metric name hasher.responses, tagged with the specific status code associated with the response.

def response(statusCode, body):
    log(metric_name='responses', tags=['hash-service', 'status:' + statusCode])

By tagging the hasher.responses metric with the associated HTTP response code (e.g. status:404), we can break down our responses in Datadog to visualize successful requests alongside the count of specific error types:

Graph of response codes in Datadog
Successful responses (200 codes) returned by the Lambda function are in green, while 404 errors are red.

Capturing latency statistics

Of course, not all metrics are simple increment-by-one counts. For example, we may want to track the duration of our hash_exists function, to make sure it doesn’t introduce unacceptable latency into our overall application. In the example script, we’ve calculated the latency of that function as function_duration. With a log line submitting that data as a histogram metric, we can capture the latency of hash_exists and track its distribution over time:

    # subtract start-of-function timestamp from the current timestamp
    function_duration = time.time() - function_start
    log(metric_name='hash_exists.latency', metric_type='histogram',
        metric_value=function_duration, tags=['hash-service'])

As mentioned above, submitting a histogram metric provides you with several aggregate metrics on application performance in Datadog, such as median, min, max, and p95 latency.

Graph of function latency in Datadog
A function's median latency is graphed in relation to the p95 latency.

Combine and correlate metrics

Even with a fully instrumented Lambda function, supplemented with the standard service metrics emitted by AWS, there can still be gaps in visibility. A complete monitoring plan would also take into account any external system dependencies your application may have, such as load balancers and data stores. Combining and correlating metrics from all of these data sources provides a much more comprehensive view to help diagnose performance issues and monitor the overall health of your system.

Our custom Lambda metrics combined with metrics from other services
A custom Lambda dashboard tracks usage and performance metrics from the Lambda function, as well as metrics from dependencies such as AWS S3 and API Gateway.

Combining relevant metrics from interconnected services into a single dashboard like the one above provides a ready-made starting point for troubleshooting the performance of a serverless application.

Alerting

Once Datadog is deployed to capture and visualize metrics from your applications, Lambda functions, and infrastructure, you will likely want to configure a set of alerts to be notified of potential issues. Datadog’s machine learning–powered alerting features, such as outlier detection and anomaly detection, can automatically alert you to unexpected behavior. Datadog integrates seamlessly with communication tools like Slack, HipChat, PagerDuty, and OpsGenie, ensuring that you can alert the right person to issues when they arise.

Full observability ahead

We’ve now walked through how you can gather meaningful metrics from your functions with just a few lines of instrumentation. We then tied it all together with metrics from the rest of your systems.

If you don’t yet have a Datadog account, you can start monitoring your Lambda functions today with a . If you’re already up and running with a Datadog account, enable the Lambda integration to start enhancing the observability of your serverless applications today.

Acknowledgment

Many thanks to Datadog technical writing intern Rishabh Moudgil for his contributions to this article.


Want to write articles like this one? Our team is hiring!