Log Rehydration™

Published: July 17, 2019

00:00:00

Renaud: As you know, Datadog log management was launched in March 2018 and we’re happy to see how customers adopted this solution.

We are now ingesting hundreds of terabytes of logs every day for thousands of companies all around the world.

It provides a seamless experience for DevOps engineers with hundreds of Datadog integrations that are now collecting metrics and logs, sharing the same tags.

So when you see their interface for the first time, everything is pre-canned, easy to use, and you seamlessly pivot from (and to) logs, metrics, and traces.

Last year at Dash, we announced a very appreciated addition to the product (I think you know where I’m going) Logging without Limits™, where you can decouple what you ingest and what you index, which unlocked so many capabilities such as you are now able to send all your logs affordably with no server-side filtering anymore. Then, observe them in real-time through the live tail.

Importantly, you can dynamically control what is indexed through a stringent filter, depending on situations.

And you can sleep peacefully because all these logs are archived into the core storage of your choice.

Monitoring the backend is critical for ensuring that your requests are properly delivered to your end-users.

And you’re already well covered with Datadog.

The new JavaScript integration

However, some questions still arise like: “Are my users truly happy with their experience? Do they get some errors?”

The fact is that you don’t know because you are not necessarily looking at it.

So this is why today, I’m very happy to announce that Datadog is officially entering into the frontend monitoring space with the JavaScript integration.

This is a simple JavaScript library designed to collect all the console logs automatically and send them straight to Datadog.

You can also use it as a logger where you can enrich all JSON entries with attributes and context, and URLs, clients IPs, and user agents are automatically collected, parsed, and named the right way.

So correlating your frontend to your backend is made simple.

I’m going to continue the story with these new browser logs.

Let’s say you have a huge number of users which subsequently generate a large amount of logs.

Unfortunately, indexing them all will be considered cost-prohibitive.

What is nice here is that I can keep 5% of them defining an exclusion filter on my index.

And the team is very happy because they can still see everything that is going on, especially as you don’t lose a single one because all these logs are still going into the AWS s3 archives.

Generate Metrics from Logs

Now, what if the team wants to keep track of the actual count of errors by browser families?

And this is going to be my second surprise this year.

Yes, Datadog is now able to generate metrics from logs.

And let me highlight this very important point.

It works at the ingest level, so you can actually summarize, very affordably, large quantity of logs into very efficient custom metrics.

Hear how that works.

In the log menu, you’ll find a new generate metric step.

Then step one, you tell us what you want to summarize.

In that case, I’m going to keep my browser logs in errors.

Then tell us the contents of your summary.

Like, it’s gonna be a simple count with three attributes: the error kind, the browser family, and I’m also going to keep the domain name.

I name it, and then Datadog is now maintaining these log-based metrics and will retain it for 15 months.

You’ll also be able to use all the machine learning, forecasting, anomaly-detection algorithms, among others, that are available with Datadog metrics.

This is an extremely valuable addition to Logging without Limits™.

All right.

While these were exciting announcements, we saved the best for last.

Log Rehydration™

A user comes to me and say that he experienced the failure but three months ago.

The problem is that I don’t see the logs in Datadog anymore.

So what should I do?

Should I look into the archives?

We all know that trying to find logs in cold storages is slow and difficult.

So this is why, today, we came up with a solution we are particularly proud of: Log Rehydration™.

Like in this specific example, where I’m looking at the browser logs of my user, John Doe, this specific day in May, my Log Explorer remains empty.

However, just by clicking here, no new query syntax to learn, we are preparing the rehydration job for you on which we’ll find the right files, scan them to match your query, and finally reload only the log events you need to understand what happened.

And in most cases, in less than a minute, I can start troubleshooting as if these logs had always been there.

Use cases of our use are you’ll be able to run technical audits like the troubleshooting example here, business audits like, “Should I really refund my customer and I’m going to just prove that looking at my logs,” or security audits.

We believe this is a game-changer, and we are already seeing a lot of excitement around rehydration.

But as each business, each application, each team is unique, we wanted to give you another point of view from a customer today.

They’ll tell us about how they use Logging without Limits™ to ensure the quality of service of their 28 million subscribers.

Hulu.

Let me call on stage Kartik Garg, director of cloud and platform engineering at Hulu.

How Hulu used Logging without Limits™

Kartik: Thank you, Renaud.

Really, really glad to be here.

I wanted to give you a little bit of an introduction about how I came to know about Datadog Logging.

And the story starts about a year ago when a colleague of mine in my previous company, Allan Rentiado.

He came up to me and he told me, “Hey, I checked out Datadog Logging, it looks really cool. Can you check it out?”

I said, “Sure.”

Comes back to me about a week later, “Did you get a chance to check it out?”

I’m like, “No, not really.”

He’s like, “You should really check it out.”

I’m like, “Sure.”

Couple of more weeks pass by and Allan comes right back to me, he’s like, “Hey, so did you finally get chance to check out Datadog Logging?”

And I said, “No, I’m perfectly happy with our current logging solution.”

And he said, “I think you should really give them a shot.”

So here I am, about a year later talking about Datadog Logging and why it makes sense for Hulu.

But a little bit about Hulu before I jumped into why Datadog made sense for Hulu so you can get an idea of the scale at which we operate Datadog, especially for logs.

Hulu is now the fastest growing video service in the United States.

And we grew from 20 million to 28 million subscribers between 2017 and 2018.

Our total hours watched have increased by 75%.

So not only are we growing in terms of the number of people that wanna watch content on Hulu, but for a given user, we’re growing the amount of content that they actually watch on Hulu.

We have over 100 teams globally across three locations.

I come from Santa Monica, so pardon my jet lag, if you see any.

We have a team in Seattle and a team that’s in Beijing as well.

And we started out at Hulu about 10 years ago with microservices.

So we have over 1000 services and an unofficial count reveals that we’ve had team members consume over a million snacks.

But we have one purpose and that is you, our viewer.

So let’s take a look at why Datadog, which obviously, besides Allan telling me it’s a good idea, one of our core values is to start with the viewer.

And that’s kind of enshrined into…the first day when you come into Hulu, the first thing they tell you is, “We start with the viewer.”

And that’s powerful because we don’t just wanna know about a viewer or some viewers or this viewer or that viewer it’s “the viewer.”

We wanna know everything about each one of our viewers’ experience, which means a lot of logs.

And we can’t possibly index everything that we ingest, hence Logging without Limits™.

So let’s talk a little bit about our scale of logs and what we consume.

Here you can see our log dashboard and I’ll kind of drill into the logs that we consume.

So we’re in the business of live TV.

For those of you who are in the media and entertainment, you’ll know that live TV prime time is between 5 p.m. and 8 p.m. Pacific or 8 p.m. and 11 p.m. Eastern, which means logs, lots and lots and lots of logs, which our peak is close to 150 logs per minute…150 million logs per minute, not 150 logs per minute.

Sometimes our logs get away from us, and we want to be able to detect those anomalies when our logs do get away from us.

So without log metrics doing and finding those anomalies, it was pretty tricky, honestly.

Let’s take a look at our log counts, we see that there’s nothing out of the ordinary over here.

I mean, these are the services that we expect to have made a lot of logs.

So then we go back and let’s take a look at the log size.

“Why hello, Loki,” and this service is emitting close to twice the size of logs as the next highest service, but we never look at Loki logs, so let’s fix that.

Another one of the features that Renaud announced over here was archive rehydration, and we’re already seeing uses of that.

And one of the uses that we saw, and this is kind of stolen right from Renaud’s slide over here, is a user got a significant error. In our case, it was a month ago.

And the responsiveness of a specific part of the Hulu browsing experience was significantly degraded, but we were not sure why.

Heimdall, which is our edge browsing experience team, looked at countless logs from our s3 archives.

So what we did was, in this case, we loaded their logs from archive using rehydration for a specific duration to figure out what was happening in concern to some of the surrounding services.

And by looking at the neighboring services, we were able to figure out that we should be looking at this user for that service for a specific time period.

So we popped that into the UI, we waited for about 30 minutes, caught up on our favorite episode, you can do the same, of Simpsons Family Guy or the Simpsons Family Guy crossover on Hulu as Datadog crunched through about 20 terabytes of logs during that time period.

And we found that one user was hitting our service with 200 response code back to them about 8300 times in 30-minute window, so that’s about five times a second.

So, ladies and gentlemen, that’s what a site-scraper looks like.

And the fact that a single Hulu user was potentially able to harm the experience for other users on Hulu brings us back to a very important point about why this is so important for us.

Our viewers, you just wanna watch TV, and TV has always just worked.

If you remember, for those of you that have kids, when you were a kid yourself, you’d be standing there with these rabbit ear antennas trying to hold it just right in like these various positions so that your parents could obviously watch TV.

And then it was buried cable after that, managed networks, more choices, physical wiring into your home, dedicated payload of multimedia.

And then about a decade ago, or about 15 years ago, it all got changed up again, DVR, on-demand.

But we’re now at another inflection point.

And this is where our viewers’ habits are changing.

Because now it’s not about live or on-demand.

For our viewers, for you, a Game of Thrones finale is almost the same as an NBA final, but if you think about the complexities of those, they’re pretty different.

So what we are doing at Hulu is trying to build TV around you on top of an infrastructure that was not built for serving TV over the internet.

So previously, we had closed and controlled…your fully controlled networks.

Now, we’re relying on open Internet, lots of networks of routers and switches, cloud experience, lots of players, lots of technology, no single source from…basically not a single player from the source to the last mile.

And that brings a lot of challenges when we’re trying to deliver reliable, low-latency video to our viewers.

And on top of that, this is complicated by the fact that there isn’t standard adoption, unlike other areas.

So our markers for when an episode starts or ends, there’s no standardization around that, metadata isn’t standardized.

But we love complexity at Hulu.

And so we dare you to dream of better TV with us.

Thank you.

Log Rehydration™

The new JavaScript integration

Generate Metrics from Logs

Log Rehydration™

How Hulu used Logging without Limits™

Start monitoring your metrics in minutes