Finding the hidden capacity in your infrastructure (Zendesk)

Published: July 12, 2018

00:00:00

Recovering hidden capacity

Daniel: How is everybody doing?

Everybody have a good break and a good rest?

Who’s ready to get excited about performance and capacity?

That was not the response I was expecting.

I thought everybody was going be sleepy and sad and was going to need to get excited.

So, with that in mind, what if I told you that it might be possible to add capacity to your system without calling up your cloud provider and saying, “Spin me up some more instances?”

Who says it’s impossible?

All right,

you’re willing to bear with me.

And you might say it is impossible, and that might be right depending on how you look at the problem.

And today, we want to talk about the capacity that could be right now hidden in your infrastructure.

Who are we?

So, who are these guys?

My name is Daniel Rieder and this is Anatoly Mikhaylov,

and we are the Capacity and Performance Team at Zendesk.

We care about utilization, and not just infrastructure utilization,

as you can see, Anatoly is not quite pleased with the utilization of this waffle iron here,

it could be used a little bit better, it could make a little bit more perfect waffle. And that’s what our team has been about, making a little bit more perfect of a waffle out of your infrastructure and making sure that we’re doing the absolute best that we can.

Who is Zendesk?

And who’s Zendesk?

My PR Department tells me that Zendesk delivers leading cloud-based customer service software. Loved by customers for its simplicity and elegance,

Zendesk is one of the easiest, fastest ways to provide great customer service.

Our solution is easy to try, buy, implement, and use.

The truth is, we have this wonderful product called Zendesk Suite.

It’s helpdesk software.

It makes sure that your customers have a great way to contact you and have a great experience when they’re doing it.

Life, liberty, and the pursuit of better utilization

Now you know where we work and what we do.

And so we want to talk about life, liberty, and the pursuit of better utilization. And how we got to be where we are today,

because your average computer science program doesn’t ordinarily set up anyone to work in a career of performance and capacity.

People who tend to be attracted to this, tend to be somewhat fanatical about the problem space.

They’re interested in making sure things fit together properly,

they don’t want to waste resources when they don’t need to.

Now, I know that probably sounds like, you know, maybe some of you.

Maybe you work with people like this. Maybe people like this drive you crazy, because they’re the ones who want to put something in your backlog that addresses something related to performance

and you look at it and you think, “Yeah, that’s fine, it’s probably something that we should do. But don’t these guys understand,

we have features to deliver.

What is this performance and capacity stuff?

Why do we need to do this?"

And that’s the point.

And that’s a very, very well taken point. In every place I’ve ever worked, features always come before performance and capacity related issues,

and we get that, we really do.

Eventually though, neglecting these things might not be such a great idea.

There’s no such thing as a dragon

So let me tell you a little story.

This is a children’s book called, “There’s No Such Thing as a Dragon,” it’s by Jack Kent.

Has anybody ever read this book?

Read it to your kid?

Okay, we got a couple people in the audience.

It’s a very important book, you should all read it.

I’m going to briefly summarize it for you.

One day, little eight-year-old kid Billy Bixby is waking up and he sees at the bottom of his bed a little dragon.

And this dragon is about the size of a kitten.

And Billy walks over to it, starts petting it and the dragon purrs.

He thinks, “Hey, this is awesome, I have a dragon.”

He’s like, “I got to show my mom.”

So he walks downstairs, he’s going to breakfast, the dragon is following after him.

He sits down at the table; the dragon sits down at the table;

his mom is busy making some pancakes.

And Billy says, “Hey Ma, check out my dragon.”

His mom looks at Billy and says, “Billy, there’s no such thing as a dragon.”

Now mind you, Mom is looking right at the dragon,

and she sees when the dragon starts to grow because she doesn’t acknowledge him and it eats all of Billy’s pancakes,

so she has to make some more.

And this is basically repeated throughout the book.

Billy says, “Here’s a dragon.”

Mom says, “There’s no such thing.”

And the dragon grows, and gets bigger and bigger.

Until at the end of the book, he’s wearing the house like a hat.

And it’s only at that point when Mom and Dad say, “Hey, you got a dragon here, Billy.”

He says, “Yeah,”

and then the dragon begins to shrink, as soon as they acknowledge the problem.

Why am I telling you this and why did Jack Kent tell all little kids this story?

Well, let me ask you something, how many of you have got a dragon hiding maybe in your personal life somewhere?

It’s that room you haven’t cleaned, or that pile of papers that you got from the mail that you haven’t looked through, maybe it’s something in a personal relationship. Or maybe, there’s something performance or capacity related that you haven’t dealt with, that you’ve put off for just too long.

And as performance and capacity guys, very often, we’re trying to tell you about dragons, about problems that are going to grow to become way worse if you don’t deal with them now.

And unfortunately, most of the time, all people want to know is, “How big should I make the house?”

If you can’t see the problem, you can’t solve the problem

So, this is one of my favorite quotes by Marcus Aurelius, “The impediment to action advances action. What stands in the way becomes the way.”

And in my short time at Zendesk I’ve given quite a few talks about the philosophy of capacity planning and I always like to talk about this.

Because again, when you’re confronted with problems that seem extraordinary, you really have to, at least once you acknowledge them, you know that you need to do something about it.

And this is where, for us, Datadog APM has come in in a huge way. Because with that we’re able to visually show people that, “Look, here’s a problem, here’s something that we need to deal with.”

And everybody can agree.

It’s the end of the performance guy battling the Dev on the whiteboard and saying, “Hey, you know, we think that you ought to do it this way.”

“No, no, you’re wrong.

We’ve been doing it this way forever and this is how we’re going to do it."

It’s like, “No, here’s the problem.”

Everybody saw the keynote this morning I hope,

and you can drill right into APM

and say, “No, here’s where you’re taking too long.

Here’s where the problem is.

We know what we need to do now."

And why does this matter?

If you can’t see the problem, you can’t solve the problem.

The better your visualization is, the better you’re going to be able to identify issues in your infrastructure and solve them.

How is your infrastructure being used?

So where did the dragons live?

They live here.

We all used to have one of these.

It reminds me of a really beautiful looking server room.

Some of us still do have these, although most of us are kind of moving to the cloud.

And the cloud has made a lot of things easier but in some cases, it’s made us a little bit too complacent.

Because we’ll take a look at some systems stats and you know, and we may be beginning to get a little bit too hot on CPU or storage or something like that. And, as if by magic our friends at the cloud providers will gladly spin up some more instances for us, or gladly upscale the ones that we’ve got.

And in fact, one of our cloud providers, and many of you will know exactly who I’m talking about, when you’re interfacing with them they’ve got a great big button on their page, and what does it say?

It says, “Call me.”

And what happens when you click that?

Your phone instantly rings, it’s almost eerie.

And the person on the other side is like, “Hey, we just saw your ticket.

We’re going to help you, it’s going to be wonderful."

And that’s great, it’s wonderful to have that kind of capacity on demand like we do today.

One of the things that your cloud provider doesn’t necessarily care about is: how that provision capacity is going to be used. And why should you care about this?

Sometimes it looks like you need more capacity, and sometimes the right thing to do is spin up more instances, or upscale the ones that you have.

But sometimes, you could go a different way. You could use the capacity the dragon is taking up and is being stolen by your infrastructure.

What if your system stopped wasting resources?

And let me ask you, how different would your capacity picture be if you stopped wasting resources?

Who here knows what minerd is?

I know more of you know what it is than you think.

Minerd is a Bitcoin mining client, it’s extremely resource-intensive.

I know a lot of people, I’m sure you may too, know someone who thought it might be a great idea to stress-test your infrastructure by running minerd.

It sucks up all your CPU.

And you know why you’re laughing?

You’re laughing because you or someone you know, put minerd on something and might have gotten in a little bit of trouble. Because you’re sitting there mining bitcoin on and you thought no one would notice and this would be a great way, because you deserve a bonus after all.

Anyway, fortunately for most of us, we all know that unless you’re running it on purpose-built hardware anymore you’re not going to make enough money to cover the cost of electricity, which is why you were running it with someone else’s electricity in the first place.

Anyway, we can all agree that running something like this is a total waste of resources, not something we want to do.

It might even be theft.

But, something like this, it’s a waste and we would agree we wouldn’t want this sort of dragon hanging around at all. Easy to spot, easy to get rid of.

We uninstall and away we go.

What are you dragging around?

Quick story before we get to the meat here. Quick thing about cars.

People really love them, they’re great for all sorts of things.

Us out in California, many of us are saddled with a very, very long commute in our cars, two hours plus each way.

You probably have the same thing in New York, if you’re stuck in a car, it’s no fun.

So what has that made car manufacturers do?

Make them smaller and smaller, different sorts of engines, hybrid engines, electric engines.

Why do they do that?

So that you can be more efficient, you can get better gas mileage, you can do all kinds of great things.

What do we have here?

We’ve got this little tiny car with a trailer.

It’s perfectly capable of towing, you can tow a trailer with your Prius, it’ll be fine.

But that’s not why you bought a Prius.

You bought a Prius to save on gas,

and if you’re going to voluntarily tow around this trailer, you know you’re not going to get the same kind of gas mileage.

And this is basically what many of us end up doing to our infrastructure.

We’re giving it work that it doesn’t need to do and then wondering why we are running out of gas.

It’s the same thing, the same principle with this car.

Where might your capacity be hiding?

Now, where might your capacity be hiding?

This is the part that everybody’s been waiting for. Where are these dragons? Where is the theft? Where do we start to look?

Not in the car, certainly.

Errors and redirects

Well, the first thing we might want to look at is: What sort of errors and redirects is the system generating?

And we’re going to talk about what kind of things we found at Zendesk,

it was a big part of our capacity recovery story.

Internally, I named this “The War on Errors.”

It sounds very official, sounds like something, you know, any kind of social problem we’ve got in the U.S. we declare war on it. The War on Poverty, the War on Drugs, they all went really well.

This one, I’m happy to report, went much better.

So we declared War on Errors.

Excessive internal calls

The next place that you might want to look for hidden capacity is excessive internal calls.

This is again, where Datadog APM comes in extremely handy because you can look and you can see, “Oh, look.” In our case we found one situation where something was calling the database 27 more times than we thought it ought to.

And we could all agree that that was something that we had to stop and because we had APM we could show people, “Hey, look, look at what your application is doing.”

And nobody knew that this is what was going on.

Turns out somebody had used a RubyGem that, you know, did some things that people didn’t expect.

But you never would have seen it if you didn’t have the right visualization to look at it.

Excessive external calls

Here’s another thing we’ll cover later on is, excessive external calls.

Now, figuring out rate limits in business-to-business environment can be kind of tricky.

In business-to-customer, if you have excessive traffic, you label it as abuse, you black hole it, and you go on about your day,

happy that you’ve gotten rid of somebody else trying to DOS you.

But in B2B, we need to be a lot more careful. Because the traffic is, by definition, coming from a customer who’s paying you some money. And the customer is always right.

And so we have to be a little bit more careful.

Cacheables

Fourth thing, cacheables.

And why do we have caching up here?

Caching is obviously a strategy that you should have investigated.

It’s assumed to be going on today.

You might be running Memcached or something like that. But do you understand what your cache hit ratio is?

Have you checked?

Can you really zero in on that?

Do you have any additional static content which ought to be cached and isn’t?

And how do you know if you would be in that situation?

Good traffic

And the last one is kind of a surprise.

This is good traffic.

Is it okay to ask yourself this question?

Is it okay if a customer programmatically sends requests?

Or are you expecting traffic only to come from, you know, at the speed that a human might be able to generate it?

If programmatic requests are not okay, how are you detecting it, and what are you going to do about it when you find it?

Problem: HTTP errors at Zendesk

So, HTTP errors at Zendesk.

This is just kind of a small example of what we found when we started looking at our traffic.

Our team is only six months old and we found all kinds of things in the War on Errors.

So, the first thing in Zendesk support.

Roughly speaking, we handle a bit over a billion requests per day, on average.

When we started this program the problem was that, again, roughly speaking, a third of these were errors and redirects.

And this is a big problem.

In our case, it turns out that errors are as expensive or even more so to serve than 200s, okay.

And you might ask, “Dan, that doesn’t seem right.”

And you might have a good point, but remember, one thing we know about errors is we can’t necessarily cache them.

And in our case, a request resulting in an error had to traverse our entire stack in order to determine that the data isn’t there.

Now, when you’re starting out a situation like this might be fine, when you’re in startup mode. But when the amount of instances that you deploy has a direct impact on your bottom line, and I think that’s true for everybody sitting in this room, and the bottom line really matters, then you really ought to get this dialed in and having a situation like we have here simply isn’t acceptable.

And it could always be worse.

And for us, it was worse.

And when we were working through this problem at Zendesk, Anatoly said to me one morning, he said, “Dan, we’re to the point now where for every server we install, we’re going to need to install one just to handle errors.”

And this upset us a great deal.

We were, you know, first of all we’re sitting there going, “Is this right? Hopefully, this isn’t true.”

Dragon number one: 429s

Now, since Zendesk has all kinds of customers, and all kinds of ways for traffic to enter our system, we needed a way to determine who it was that was causing these and where it was coming from.

And we had to pick someplace to start.

And the first place we started was with error 429.

And for those of you not familiar with what 429 is, this is basically the code you get when you’re sending too many requests.

And this was the first and most obvious dragon that we were encountering and trying to convince people that it was there.

And the user sent too many requests for a given amount of time.

Again, if you were part of the consumer internet, you’d black hole it, you wouldn’t care. B2B, different approach.

How big of a problem are we talking about?

And this is what we dubbed upside-down customers.

I know the Datadog, it’s a little pale, but you can see the numbers. In this case, our upside-down customer had four million 200s a day, four million good requests. But they had 15.5 million 429s.

And Anatoly said to me, “Hm, people are asking us to give them capacity.

Maybe we ought to ask customers to respect HTTP 429 Retry-After headers before we start spinning up anything else."

And I said, “Hey, is this really a big problem?”

And he said, “Yes, 429s are crazy expensive.”

Again, you can’t cache them.

In our case, the data of Customer A, “Where’s your rate limit?” is way down the stack.

So again, you’re traversing the entire tech stack in order to find out what your rate limit is in order to tell people that they had exceeded it.

And that’s not good.

So, how do we solve this problem?

And it’s a major one.

And you’re going to be very displeased about how we solved this problem.

What we had to do is we had to reach out to the customer.

And this is something, as technical people we’re not, it’s not a solution we’re often comfortable with.

But in our case, it was the most expedient one in order to deal with what was going on.

We have a great Customer Success Team and we had them reach out to the customer. Because of course, you don’t want engineers calling up the customer and saying, “Stop that.”

They did it very nicely.

And, yes, of course, we know this won’t scale.

But again, it was very, very important.

Now, here’s the cool part because you’re asking yourself, “Well, how did you find the accounts that were creating that kind of traffic?”

And, Mr. Anatoly has made a little animation that’s going to kind of walk you through how we used Datadog APM to find customers that were either good or were sending far too many 429s.

Take it away, Anatoly.

Identifying 429s

Anatoly: Hi, everyone.

At Zendesk we use Trace Search every single day, and it’s such an amazing tool, we love it. Everyone at Zendesk loves it.

Why do we love it?

We stop talking about the problems, we start showing the problems.

And to show the problem, the best thing is the screenshot or animation GIF.

We’ve prepared three animations GIFs for you.

And we will show how to use APM at Zendesk.

Let’s take two accounts.

You will see some animation GIFs 40-seconds long. You have Account ID 1 and Account ID 2.

And each account will generate its own traffic.

So, animation begins.

As I said, Account ID 1 and Account ID 2.

If you see at the bottom, there’s a graph on Account ID 1 and Account ID 2, bottom is slightly different.

Now, we’ll show you the traffic by status code.

Yeah, bottom is slightly different.

In a second you’ll see that Account ID 1 generates specific errors all the time.

We break down by status code.

And you see, it’s very similar to the graph we saw before.

And now we’ll just show you that you can tap Account ID and you can specify and see that Account ID 1 generates just errors all the time, nothing else.

Where Account ID 2 generates only success responses.

And APM Trace Search is super powerful to show you that just in 40 seconds.

The problem is, sometimes the duration of the 429 can be as expensive as the 200,

and you probably want to go and check in your system.

This is a trial account, test account, with some sample data,

but you may be surprised at your system, and it may cost the same as a 200.

Dan?

Dragon number two: 304s

Daniel: Great.

So, the next dragon was 304s.

And just, in case you’re following along at home, anything in the 304 range is a redirection.

The real errors on the client side are in the 400 range and anything on the server side is going to give you a 500.

If a client gets a 304, not modified, then it really should be the client’s responsibility to display whatever it needed from its own cache.

How big a problem was this?

Well, at the time we were making this presentation, again, the amount of 304 errors we were seeing in the infrastructure, it was very bad.

It’s gotten better.

And no one can disagree that when you see this many errors over the course of a day, that you have problems—when it becomes one third of your traffic.

Let’s think about what’s happening on a customer’s side when they get a 304.

There’s absolutely no benefit for them at all,

they’re not going to change anything.

The data they have is not stale, but we spend loads of computational resources to give them a response and let them know that it’s not stale.

So, what we did, and that we would recommend everybody taking a look at, is we implemented HTTP client caching, which is basically we put a cache-control line in the header.

And in this case we were going to have it to be max-age:10 to instruct the browser not to reach us for the next 10 seconds because with Trace Search what we found is that some of our API endpoints were updating less frequently than 10 seconds.

So there was really no need to update any faster than that.

It really didn’t require any additional changes on the client side.

We just had to instruct the browser not to hammer us as frequently as they intended to.

Now, the way we handle 304s turns out to be very similar to the way we handle one more dragon.

But before we get to that, we’re going to have Anatoly show you how Trace Search helped identify some of these things.

And, in this one we’re really looking at our different end points that we had.

Because traffic can enter Zendesk through either the web interface, you know, you can imagine you are creating a ticket, or it can enter in through an API.

And so being able to identify where your endpoints are and figure out what’s causing problems is really important to us.

So I’ll let Anatoly take you through that.

Identifying 304s

Anatoly: Thank you, Dan.

I’m going show you, the traffic comes from the same account,

there is no account ID,

but you may see the status code is a 304 not modified or 200 OK.

And I’ll take a look before … I saw the animation of the request rate over there.

How many requests do you see here?

Twenty thousand requests come from an API, and only 2,000 requests come from the web browser.

We have two sources of the traffic, as we said.

And on the animation, we will show you just 304s.

The traffic pattern is kind of not heavy, but, if you compare them both on the same screen, if you group by account ID, there you’ll see a massive difference.

Again the web, the same as the previous graphs.

So just do group by …

now we group by the status code.

Now you see the problem, 304 is so much different though compared to 200s.

And this issue we see in our system all the time pretty much every single day.

The duration can be the same, 304s or 200s.

We spend the same amount of time to compute the response,

but for 304s, customer get absolutely no benefit.

And APM search, APM Trace Search can show exact URL where we spend this time.

Yeah.

The last dragon: 200s

Daniel: All right.

So, we had one more dragon and it was kind of a surprising one.

And this was actually, you know, it’s not technically a problem—but it can be excessive.

And you should take to understand when normal OK requests are a problem and when they’re not.

Here are some things that, you know, Anatoly and I considered when evaluating good traffic at Zendesk.

The first is that our capacity is being built and will continue to be built to support traffic that a human being can generate.

That said, we had to proactively reach out to customers whose traffic profile could be characterized as excessive or anomalous, and work with them to take action.

We also have many publicly available unauthenticated API endpoints,

and these should not touch our infrastructure directly.

Ideally, one should build as many guards as possible: circuit breakers, backup mechanisms on every layer.

And we would recommend that you use HTTP-specific caching in order to instruct the client’s browser, or something like Cloudflare, not to hit our infrastructure excessively.

Because remember, Cloudflare won’t cache until we instruct our application to insert a cache-control line into the headers.

Without that, every Cloudflare request is going to come through and hit your infrastructure.

And just to take everybody through exactly what’s going on when that happens, your problem statement is, you know, there’s two ways that you can go about doing this.

You can either strive to reduce your throughput, or improve your response time; and usually what you’re looking to do when you’re imposing a solution like this is to reduce your throughput.

And this is the state that you find your system in when you don’t have any cache going on.

Your browser just goes right through to the underlying infrastructure, our infrastructure returns a code 200, returns all the data that it needed to, and you go on your merry way.

Here’s a little bit better way: if you set your cache-control: max-age=5 or whatever you think appropriate, private or public. Private is going to cache your backend API response, the public is going be for an intermediate proxy like Cloudflare.

This tells the browser to turn on the cache, even if it’s running some kind of application that, for whatever reason, wants to ignore it.

In this case, you don’t have to keep serving the data once it’s been served once. And at the end, this is what it looks like: all your subsequent requests are going to get served out of the cache until the data is deemed to no longer be stale or you’ve exceeded your cache-control header.

And to those who say, “Well, sure, that’s something that we obviously want to do.

Everybody wants to do something like this.

Basically, remember, that it’s all the same bunch of people trying to solve the same sorts of problems.

In our case, until we started digging with Datadog Trace Search and APM to understand that we had excessive 304s and excessive 200s, that we would never have known that we could have done this.

We never would have understood that there was capacity that we could have recovered, because we didn’t see it.

And what was the end result here?

Here’s one customer who is related to the sport of auto racing, Formula 1 in this case, I don’t know who exactly the customer is.

But you can see that every weekend when there was a race, there was a huge traffic spike.

And they had on their website, something that would hit endpoints and, you know, it was getting fine responses, it wasn’t excessive, it was obeying everything.

But it simply didn’t need to be served.

And so we set about trying to figure a way to do exactly what we described, to cut out some of the excessive 200s and serve more things from the cache.

And you can see the difference that it made week over week.

The first week you had, you know, upwards of …

what does it say here?

It says it had 400,000 requests,

and the subsequent week it was a fraction of that.

Cutting off traffic

And so, how do you do that?

And what is Datadog going to do?

The other two animations that Anatoly showed you, you were trying to discover things about what’s going on in your infrastructure.

In this case, this is something that we used APM to verify that the caching mechanisms we put into place were actually giving us the benefit that we hoped.

Because again, it’s fine to talk about these things, but you have to verify that what you did worked.

So let’s let Anatoly take you through that.

Anatoly: Thank you, Dan.

This is the most exciting animation we have ever created for the presentation.

And take a look at the sidebar before the animation begin.

This is the HTTP cache-control, this is basically the header, we instruct on the backend side, just to respond back to the client.

And the cache-control header can be either zero or five, in that case.

This is the sample data,

not associated to the real response.

It can be either zero or five.

So, this is the traffic.

This is the 15 minutes of time, this is the five minutes of the traffic and then 10 minutes of the traffic which was cached and throughput was reduced.

If you see the value zero or five, this is the value we showed before on the left sidebar.

In the Trace Search you can search the HTTP cache-control, max-age:0.

This is the value we set on the backend.

And now we can say, show me five or zero.

But then we can compare the time it takes to generate the response tool.

And the time is the same as you expected; same request, same response.

But we care the most about the throughput.

Can we cut off 50% of traffic, 70% of traffic?

And that’s exactly what we implemented and we’ve found the human capacity analysis.

Takeaway

Daniel: All right.

So what do we hope you come away with?

As performance and capacity guys, like I said, we care about how things are being utilized.

And we don’t just care about our systems.

Believe it or not, we care about your systems, too.

Problems like the ones we talked about, are the things that keep us awake at night.

And I bet it keeps some of you awake at night as well.

And we wanted to ask you and remind you: how many of you know that you have a dragon or two lurking in your infrastructure?

Would you know if you did?

And what are you going to do about it when you find it?

Now you’ve seen that you have more options than just turning the capacity knob at your favorite cloud provider.

Do yourself a favor and take a look at what your systems are being asked to do and what they’re serving.

Our team has existed for six months, and we continue to find opportunities to clean things up and free up excess capacity.

What are you going to find?

How much money could you save if you start paying attention to the dragons in your infrastructure, instead of just simply trying to figure out how big you need to make the house?

Thank you very much.

Questions:

Audience: Have you been able to use Datadog to find bad programming from your side?

Daniel: I would say,

Anatoly, you have a good answer for this?

Well, the nice thing that you can use with Trace Search is you can find where you are spending extraneous time. You can find, you know, and unfortunately we didn’t prepare something that traces through an entire problem, but that is one of the really powerful features is, you can see …

and like, for instance, the example that I gave you where we found that there was one RubyGem we have that was making about 27 calls to the database and nobody could understand why.

And so that was one of those sorts of things that we found that we could,

I would call bad programming that you might want to rethink your approach and do something a little bit different.

Audience: So, my question is, how much time do you spend trying to identify when a HTTP 200 was a good one or was a bad one?

Daniel: How much time do we put in?

Audience: Yeah, I mean, how much do you have to go into the data using the Trace Search solution from Datadog?

Daniel: Once you get used to Datadog APM, it feels like it goes very quickly.

This started, Anatoly correct me if I’m wrong, as a suspicion that Anatoly had.

And he had been talking to me about, “I think a lot of this traffic is, you know, extraneous stuff that we don’t need to serve.

Let’s see if we can cut down the amount of good traffic that’s there, because I think we’re just serving the same thing over and over and over again.

And it seems like we have some mistuned caching."

And it was, you know, just looking in APM, it’s like it can give you what the volume is.

And then, you know, “All right, let’s try this caching solution and see if that made a difference.”

And in our case it absolutely did.

So that’s why we really enjoyed or thought that that was a fabulous tool for testing the assumptions that you make and making sure that the solutions you put in actually work.

Audience: Hey, there.

Since you’re doing drill down by accounts, I was wondering if you could speak to on the order of how many accounts you’re looking at. And more generally, have you run into any performance limitations on the volume of tags?

Daniel: The volume of tags.

The way our data is partitioned, kind of precludes that being a problem because it’s distributed both geo and separated by our definition of what a pod is.

I don’t know if we’ve had any problems with the amount of tags.

Can you think of anything that’s been an issue necessarily?

Anatoly: Yeah.

The APM backend is by Elasticsearch, and Elasticsearch gives you pretty much unlimited cardinality, like, unlimited amount of unique tags.

And we didn’t find any performance issues with pulling like 200,000 values. We didn’t see any problems yet.

So maybe we’ll see them one day, but not now.

Daniel: Not so far.

Anatoly: Not at Zendesk scale, yeah.

No problems.

Daniel: All right.

Well, we’ll be around in case anybody has anything else they want to talk to us about.

And thank you everybody for coming,

and enjoy the rest of the day.

Finding the hidden capacity in your infrastructure (Zendesk)

Recovering hidden capacity

Who are we?

Who is Zendesk?

Life, liberty, and the pursuit of better utilization

There’s no such thing as a dragon

If you can’t see the problem, you can’t solve the problem

How is your infrastructure being used?

What if your system stopped wasting resources?

What are you dragging around?

Where might your capacity be hiding?

Errors and redirects

Excessive internal calls

Excessive external calls

Cacheables

Good traffic

Problem: HTTP errors at Zendesk

Dragon number one: 429s

Identifying 429s

Dragon number two: 304s

Identifying 304s

The last dragon: 200s

Cutting off traffic

Takeaway

Questions:

Start monitoring your metrics in minutes