Stress Testing in Production: The New York Times Engineering Survival Guide

Published: July 17, 2019

00:00:00

Who are we?

So The New York Times was founded in 1851 with a very simple mission: “We want to seek the truth and help people understand the world.”

Most of you in this room are aware of The New York Times in the news. We’ve been in this industry for more than 100 years.

But many of you may not be aware of the scale of our digital footprint.

What do I mean by that?

Challenges and Opportunities for The New York Times

On any given month, we have 150 million users visiting our applications across the site, like desktop, mobile.

Not only that, but as of now, we have more than 3 million paid news subscriptions, making it the most successful news subscription business not only in the U.S., but in the whole world.

On any given day, we publish around 250 original pieces of journalism, and that’s great work that our newsroom is pushing towards.

Let’s switch gears and talk about something interesting to all of us.

Elections.

What are elections to The New York Times?

Elections, to The New York Times, is like Thanksgiving to most of the retailers.

New York Times is like the most highly-anticipated traffic event that we ever experience in the…

And you can see from the graph.

The purple line…

It’s kind of black in the screen, but this line over down is like our normal average traffic.

And as you can see, the orange one, that’s the traffic that we see on the election results day, and that’s a huge spike.

So, preparing for this kind of traffic is imperative for us.

Last year, midterm elections were approaching.

And the goal of our midterm elections was, “We want New York Times to be the key destination for our user, for stable performance, and up-to-date key election results.”

Traditionally, midterm elections are not a big deal for The New York Times in terms of traffic, but considering the political climate that we are seeing, the projections for the traffic were skyrocketing.

And on top of that, we have some challenges that will make it a little harder for us to achieve the goal.

The number one challenge we have is, since the election of 2016, we completely changed our infrastructure.

We are now in the cloud.

In 2017, every single component of The New York Times is in the cloud.

And not only that, but we also completely changed how the development teams approach the infrastructure.

Now, each team is responsible for deployment and as well as for taking care of the incident management for their own application.

On top of that, we now have more subscribers than we ever had.

And compared to election of 2016, we have 2X subscription growth.

As you can see from the chart on the right of the screen, it just keep growing.

Now that we have more users, we completely changed our infrastructure.

The political climate, and we were just like, “Oh shit, we are in trouble.

So, we have to work on it."

Preparing for the 2018 midterm elections

To work on that, we created a project called Operation Election Readiness.

The goal of this project, the mission of this project was to organize a massive cross-team effort.

We are talking about more than 20 teams and different functions, technology, newsroom, customer service, that all have to work together to make the goal achievable.

The structure of the project was something like this where we had an election leads committee which is representative of a bunch of functions, like someone from the engineering, program managers.

And they were getting constant input.

If you see the right side on the screen, delivery engineering is my team.

So, they were getting constant input from us and the stakeholders on consultation as, “How we need to prepare for it?”

We, along with the election leads, were guiding the development teams on how to prepare for it.

Clicker works.

For today’s topic, we’re gonna focus on my team, which is delivery engineering.

Our team was given three main tasks.

One, we wanted to do the assessment of the architecture reviews.

As I mentioned, now every team is responsible for their own development, like deployment and taking care of the infrastructure, etc.

We wanted to make sure that they are following the best practices.

We started with assessing them, asking questions like “What is the level of maturity that they are going through?”

And we started assigning some tasks to help them prepare.

Second, we conducted a stress test to make sure that we are scalable.

How many of you in the room has heard or know of stress tests?

Wow, that’s quite a…

Okay.

It makes my life easier, so we’ll go through some slides faster.

So, for the rest of the session, we’re gonna focus on the stress testing and how we utilize that to prepare for the elections.

Third, despite how hard we try, we’ll still have production issues.

Can anyone in this room say that their system is completely bullet-proof?

Right. If someone is raising their hand, I need to talk to you.

So, no matter how hard we try, we will still have issues in the production.

But our goal was, whenever we have these issues, we want to minimize that.

So, we started incident management training, and we trained more than 100 or 150 developers over a month, on if anything happens, how we should debug it.

We created a process, and we did all this kind of stuff.

It was cool. As I mentioned, for the rest of the talk, we’re gonna focus on the stress testing.

What is stress testing?

Stress testing is when you are testing the system capacity by throwing, like large amount of virtual traffic.

In a very simple term, stress testing is trying to break your site, and learning and identifying the bottlenecks which will not allow you to scale.

Anyone attended the SLI workshop yesterday?

Few of them.

There was a nice point that one of the SLI was the saturation.

Let’s say if your system is able to handle, 100 RPS, how much stretch can it go through?

And you can also use stress testing to identify that SLI for you.

For the stress testing, the second thing that will come to your mind is, “What is the difference between stress testing and load testing?”

The biggest difference is, in the load testing, your focus is to analyze the system behavior under expected load.

What I meant is, you know that your system is going to get, let’s say, 500 RPS, so you do a simulation of 500 RPS and then you analyze your system health, like how the CPU is doing, what are the resources out there.

Mostly, it’s been used to determine the throughput.

Like what is the maximum capacity of my application?

It’s also used to identify the resources needed.

We are in the cloud environment, everyone is aware of the cloud cost.

So, you can also use this kind of load testing to make sure how much actual cloud resources I need.

And compared to that, stress testing, as I mentioned, is more focused on identifying the bottlenecks.

Like which part of your microservices, or which part of your application will start breaking first.

Or what are the things that you will analyze if you’re going over the expected limits of your system?

Mostly, it’s been used to prepare for the high-level events such as elections, Thanksgiving, more or less, like, selling out of tickets for a major concert.

And you can also do this regular exercise to actually analyze the maturity of your own application.

Key objectives for stress testing

So, what are the key objectives for the stress testing that we wanted to do?

The number one objective was, we want to verify that the system that we have is scalable.

So, what it means is we wanted to test that it will able to handle the traffic.

And not only the traffic in a gradual, but sometimes we also have peaks in the election, as you can see from the graph.

So, we wanted to simulate all those kind of scenarios to make sure that we are prepared for it.

Oops.

Second, not only that, we wanted to make sure our site is performant.

What I meant is, you go to The New York Times on the election results day, your site is loading, is taking three to four seconds.

How many of you still will wait for the whole of website to load and look at the results?

If your site is taking three to four seconds to load on that day, it’s almost like giving you 500 errors.

Because users will just go to another site and get that information.

So along with scalability, performant was also very important for us.

The most important part, as I mentioned, we want the election results to be up-to-date.

What I meant over here is, like, newsroom will keep pushing the new updates of the election result as soon as possible.

And we wanted to make sure that the publishes that are happening shows up on the site quickly.

In a very simple term, we wanted to avoid stale content.

Yes, we have caching, but who wants to see the cache of two-hours old on the election results days?

It’s almost like no value at all.

Lastly, since we are going for stress testing and it has the ability of breaking the system, we wanted to take this opportunity to exercise the resiliency plan that we have put into place.

Like verifying the incident management process, how the system is recovering, what are the logs that we have to take care of, and things like that.

For the next four sections, I’m gonna talk about how did we plan for it, how did we prepare for it.

The most fun part, how did we execute it, and what are the results of the stress test.

So, let’s start with the planning.

Planning for a stress test

When you plan for such a massive test, the first thing that comes to your mind is “What tooling will we do?

What power we need to simulate such a great traffic that we want to do?"

So, we decided to go with JMeter, which is an open-source tool for writing the load scripting.

It’s a JVM-based HTTP and it’s a very powerful tool.

But JMeter itself has a limitation in a way that it needs a platform to run.

And you can’t run JMeter from your laptop to simulate the traffic that we wanted to do.

So, we partnered with BlazeMeter.

BlazeMeter is a cloud provider for load generating that takes the JMeter script, spins up bunch of load generators in the cloud, and aggregates the reports for you.

So, it makes our life easier.

You can also do this kind of stuff without BlazeMeter, it just needs a lot of work in terms of spinning up bunch of pods or servers, and do some configuration to collect all the reports.

The second thing is, which environment we want to run this test? Which environment do we want to break?

And I almost realized, you got this answer from the title.

We wanted to run this test on an environment which is, A, stable, which is a production mirror.

How many of you can confidently say that the staging environment is as stable as…or as mirror-like as production?

None of it.

We are in the same boat.

I mean, sometimes it doesn’t make sense, but also.

So, we wanted to have data and an environment which is like production.

And lastly, we want people to take this test seriously.

So, our higher leadership got buy-in, and thank you for that.

Our newsroom folks were brave enough to say, “All right, let’s do it. Let’s break the production New York Times website and see when we fail.”

As I mentioned, since we are going to break the production website, preparation was very imperative for us.

And as I mentioned before, we had more than 20 teams that we are working on.

So, our program managers… And we worked really hard in doing a lot of internal coordination and deciding, what time we want to run the test?

What are the days where we will not have any planned news cycle?

What are the teams that we need to take care of this in terms of communication?

Not only internal, but external coordination is equally important.

Nowadays, we have application and we have a lot of partners in cloud or any monitoring.

So, we wanted to take this opportunity to have them also participate with us in the test.

So, we wanted to strengthen our partnership.

So, we invited them to participate in the test and get a feel of how it will feel like on an actual election results day, so if anything happens, they know exactly who to contact and how to solve it together.

Again, since we are going to break our production website, a failover plan was needed.

Like what if the production website is completely down? What if you open your iOS app and you can’t see the news?

So, we created a bunch of scenarios, like, “What to do if this happens?

What to do if this happens?"

So, we started going all that way deep into the path of preparing to make sure that we also provided customer service.

Like what to say when we are going to be down?

Learning review.

We just don’t want to run the test and forget about it.

You want to learn from the test.

So we created a structure in which someone will log the timeline while the test is executing.

We created a process on how those timeline will be consumed after the test is over.

We will conduct a learning review as well as a blameless postmortem, and have some lessons learned from it.

Like what are the action items that we have to do?

Since we talked about planning, let’s talk about preparing for the test.

The first thing we did is, we started understanding each application in and out.

So, what it means is, like, what is the business purpose as well as the technical purpose of the system?

What are the internal and external dependency?

What is the role of caching?

So caching plays a very important role in terms of load testing.

So, we wanted to make sure, like, how they are set up, so learning from it so it helps us design the test accordingly.

Throughout the next four sections, I have a small section called ‘Tips,’ which will have things that we learned or we did take care of.

The first thing I want to clarify is, like, don’t try to stress test your website through any content delivery network.

You will not have enough power at all, and it doesn’t solve any value to you.

Second, unless you have a strong reason don’t hit the cache endpoints, hit the host directly.

That’s where you will learn the most.

Scripting your stress test

Second, script.

JMeter as well as most of the load testing scripts out there, are not browsers.

What I meant is, when you go to the browser, you go to newyorktimes.com, it parses a bunch of calls and it makes automatic calls that are in your JavaScript.

But JMeter doesn’t do that.

So, you have to design a script in a way that mimics your application behavior.

Like a user interacting with the site.

So, we started recording the traffic of that actual, and make sure, like, “Okay, if you’re going to The New York Times, how many times it’s calling our user info API and all those other APIs.”

And we started designing the test that will mimic an actual user.

We also worked on creating the test data.

Since we are going into production, we wanted to make sure that the traffic that we are simulating is as much production alike.

So, we created a bunch of cookies and headers, and we created for it.

The third point is very important.

As I mentioned, that we wanted to make sure that our site on that…

Like, one of the key objectives for the stress testing is to make sure that the results are accurate.

So, how to do that?

We created a bunch of scripts that mimic newsroom activity (pushing new updates at the same time).

So, we are throwing the traffic from the user perspective, and then we are throwing the traffic from the newsroom perspective where we are busting the cache at the same time.

We wanted to stress the system from both ends.

Minor tip: make sure you send the full HTTP request whenever you do this kind of testing, otherwise a bunch of cloud providers will start calling you and they will think that you are in some kind of SYN attack.

So, don’t do that.

Make sure you send the full HTTP request.

Designing the stress test

Third, designing was very important for us.

As I mentioned the graph before, there was some kind of traffic, like we have seen the pattern.

So, we analyzed the past election results traffic as well as any peak traffic that we have seen.

And we designed the test accordingly.

Like what level we want to go at that time…

And we’re gonna talk more about in the execution phase.

We also exercised bunch of location-based scenarios.

So, what I meant is, many of your applications may be in the cloud, and depending on where the traffic is originating, your servers might be getting hits.

So, we also exercised that, in real life, for example, if the 30% traffic is coming from the east, let’s throw 30% traffic from the east and 70% from the west.

So, we tried to do as much production traffic as possible.

One of the tips from here is, you want to know where your app is hosted.

And I’m sure that all of you will know that.

But one thing to keep in mind is if you are hosted in, let’s say, GCP, don’t do a simulation of traffic in the same cloud provider.

Because many cloud providers have something called internal routing.

So, you will not get the real picture of latency and things like that.

So, if you’re in the GCP, throw traffic from AWS.

If you’re in from AWS, throw the traffic from GCP and vice-versa.

Data collection during stress testing

Data.

That was interesting.

Since we are going to run on production, we will create data production.

We don’t want our business to where you have pages like, “Hey, we have many users on just one day.”

So, we worked with our team to make sure that we provide a way to identify the traffic that we are throwing, so that they can either scrape the data, or do whatever they want, but at least make sure that we give them a way to identify it.

And to do that, we provided a very easy solution called a refer header. If you guys can try to read along with me, it’s a fun name, which is “October stress test to rock on election day.”

Yes, we really wanted this to be unique.

Along with that, we also made sure of very common stuff, which is…

We don’t want to hit cost implication ads.

We don’t want to test other servers.

We also don’t want to test anything which might be…

Like analytics, which gives the fake numbers and stuff like that.

Monitoring your stress tests

Lastly, we’re in the monitoring conference, so we need to talk about that.

So monitoring plays an important role when you are conducting test like that.

So, we created a bunch of dashboard that will give us an indication of the system health just by one dashboard.

Like, “What is our master dashboard that will show the health of all the stuff?”

Along with the system health, we also identified a bunch of metrics that we want to keep an eye on while during test execution.

I’m talking about more on the stress testing part at this point.

As I mentioned, one of the key objectives of our stress testing was, we wanted to make sure our site is performant.

We didn’t just want to see if it’s throwing any 400 or 500 errors, we also wanted to keep an eye on the response times.

So, we diverted the app metrics that we identify, like the 95 response time, what happens to, like, 99th when your stress time is above the normal peak traffic.

Executing the stress test

Next thing we’re gonna talk about is how did we execute the test.

And this day was one of my most nerve-wracking (as well as exciting) days at The New York Times.

Think of this day as like NASA launching some kind of rocket.

We had a war room where every representative of the team were there, representatives from the customer service as well as the newsroom, and they were waiting on us.

Like we were in the command of running the test and making sure of that.

So, we had some logistics that, of course, we had to take of, making sure there is enough room.

We established a communication protocol, that if you’re in application A, you will use this kind of communication protocol to talk to us while we’re executing the test.

And, as I mentioned, we had a bunch of stakeholders also in the room, just curious what happens if the site goes down.

That day I’ve never experienced.

Second.

This was the fun part.

It would have been impossible for one person to run the test at such a large scale and keep an eye on all the reports.

So, we had multiple executors.

And each test executor was given a responsibility of particular test or applications.

And they were the sole responsible person to make sure that they are updating the stakeholders on what’s actually happening on that particular test.

Second, as I mentioned, JMeter is not a browser, so that adds another complexity.

When you have multiple test executor running the test at the same time, running them synchronously, it’s very important.

Because in real life, if you go to, like, your homepage, your back-end API might be calling three times.

But if you’re not synchronous over here, then you may be calling your homepage one time, but your data API 30 times.

So, that doesn’t relay the actual traffic.

So, we wanted to make sure everyone is synchronous, and we created a bunch of ways of identifying, like, where they are at any given time.

Third point, incremental.

As I mentioned, we wanted to design the test in a way that we are seeing the traffic on the election.

To do that, what we did is we created the test for multiple steps.

Like one, two, three, four, five, six, seven, eight.

And at any given time, we wanted to make sure all the executors had the same time.

And if the commander is giving, “Let’s go one more level up,” then we are going one level up.

I don’t know, it was fun.

Like, I don’t know how many of you will get chance, but it’s fun running the test to break your production site.

And we’d run the test a total of two times.

Why?

Because we ran the test for the first time and we found a lot of issues, that we’ll go over after that.

We wanted to make sure that the team resolved those issues, and we ran this test again at the second time to make sure that they have resolved it, and then we are fully prepared for D-Day.

You guys will be wondering, like, “You guys have worked so hard designing, preparing.

Did anything meaningful came out of it?"

Of course, it did.

As you can see from this graph, which represents the system health of one of our applications during the testing,

everything was going good for first 30 to 40 minutes, people were chilling, talking.

And all of a sudden, there is a wall of 500 errors.

Not 10%, 20%, 100% 500 errors.

Imagine this graph on an actual election results day.

I’m sure people will not be happy.

We were so glad that we were able to find out things like this on an actual stress test day, and not on the election results day.

Key findings

What are the key findings apart from that?

We identified a bunch of bottlenecks.

We are all in the cloud, and one of the things we found out is our autoscaling rules were not optimal.

They were not configured to handle the traffic peaks that we saw happen.

So, we tweaked lot of our autoscaling rules.

Along with that, we discovered something called cloud quotas.

How many of you in the room are aware of cloud quotas?

Nice.

I think, by the end of this session, hopefully, everyone will go and learn about it, because the graph that you saw before was we exhausted our cloud quotas, and it took us 30 minutes to identify actually what’s going on.

So, we’d understand the importance of the cloud quotas.

Also, we unsurfaced a lot of tech debt.

We all have that issue where, “Okay, we’ll do it tomorrow.

It’s not a priority.

We’ll have it on next year, next quarter."

And we found out that a lot of the tech debt that we were shoving under the table for a long time started hurting us.

So, all of a sudden it becomes a priority for us.

Along with that, we saw a bunch of latency issues.

Yes, the site were not giving, like, 400, 500 errors, but the responsive time was so big that it’s almost, like, not good for us.

Second, we found that we have more improved observability.

Over here I’m not only talking about technical observability, but also the team observability.

Like in an organization like The New York Times where you have multiple teams, we found out that how that team is depending on another team and what is the relationship between them.

So, we had a lot of teams who learned about each other, and then they were like, “Oh, my API is being consumed like this by this client.”

So, they learned a lot of stuff like that.

We also identified the key metrics SLIs that we want to keep an eye on during an actual election results day.

Because, again, this was an exercise that we wanted to learn.

This gives us a way to determine which monitoring dashboards that we need on that particular day.

And it also helps on other logistics.

For example, which team needs to sit next to which team, what cloud provider we need to provide for each team, because they have to work together if any issues happen.

We also exercised a resiliency plan.

As you can see, we had an issue, but that gave us a chance to basically exercise the resiliency plan.

We had an outage, so right away people on call that day, they went into action and they exercised the process that we had set up.

Once we identified the bottleneck, once we learned the lessons, once we had all the stuff, we had increased confidence that we have never before.

We had a new cloud configuration, everyone was a little skeptical.

But once we ran the test and once we identified all the issues that we need to resolve, we had an increased confidence, like, “Okay. We are ready for it.”

Election Day

But are we actually ready for it?

Let’s see what happened on the midterm elections day.

As projected, we had record traffic for the midterms.

One of our service graph, that over here, we received up to 40X sustained traffic on that night.

That was beyond our projections for that night.

Along with that, since we are able to serve the user with the more stable, performant, and up-to-date key results, the users were happy with us, and they gave us business.

So, we saw significant registration growth on that particular day, that helped our product managers to understand that, “Okay. Performance has something to do with business growth.”

Third, as I mentioned, we did…everyone will have an outage.

Even though we tried, there was an outage on our website too.

But the time that it took to resolve that outage was a matter of minutes, because the team was already going through the process when the outage happened on the election test day.

So, they already got the rehearsal on what to do.

That helped us a lot on that day.

This was the very first time that The New York Times conducted an exercise like this with the tech room as well with the newsroom.

And it not only helped us prepare for the midterm elections, but it also helped us understand and learn how to design future systems in a more reliable and scalable way.

So, we have more and more demand that, “Oh, we should do this more and more often and not just on Election Day.”

More important than ever, elections are coming next year, and now we feel we are better prepared than ever because it’s the same configuration, we learned the lessons, and we just have to repeat ourselves.

The things that we did for the midterm election 2018 was incredibly helpful.

And we are assuming this will be even bigger traffic spike that we’ll ever see.

You guys will be asking, “Was it all worth it?

You did all this stress testing.

Your teams might be doing this and not doing any feature development for months."

I will say yes.

Because not only were we able to satisfy the users by giving them the best experience, our newsroom worked hard day and night to get the reporting on the ground, and it’s our duty as a technologist to give them a space and a stage to represent their voice.

So, we’re really happy that their work was actually displayed, and that the election results were accurately shown to all of you guys.

And with that, I’d like to thank you all for coming to this site—and successfully suffering from food coma.