Building a Culture of Continuous Improvement at NASA (Paul Hill) | Datadog

Building a culture of continuous improvement at NASA (Paul Hill)

Published: July 12, 2018

Lessons learned at Mission Control


Ladies and gentleman, please welcome Paul Hill to the stage.

Paul: Thank you, Alexis.

Thanks for all of you being here.

I think I fit under the category of now for something completely different.

Now, with a picture like this you might expect I’m gonna talk about something technical and rocket-sciencey, and I’m not.

I am going to talk about Mission Control.

The same Mission Control incidentally where we managed to plan every mission, train every astronaut, fly every mission that NASA has ever put astronauts into space with.

Where I actually spent 25 years.

And what I wanna share with you isn’t really technical stuff, it’s more cultural.

And it’s some things that are less obvious kind of behind the scenes that are the reason why the Mission Control organization at NASA has remained a continuously learning, continuously improving organization and how that continuous learning is wrapped right back to how they have been able to perform at the highly reliable levels they are demanded to perform at to do such a difficult job.

Learning, not failing

Now, the reason we are so focused on this continuous learning, of course, is because failure to us can be huge and can be catastrophic.

But this notion on, how we continue to learn, our lessons learned process, or as this industry may call them, postmortems, isn’t restricted just to our failures, it’s also how we dissect our successes.

But to give you an understanding of what failure means to us, I have two quotes that I’ll review with you.

One of them from this fellow, Thomas Edison, who after he was famous for doing a number of things, including inventing the light bulb, he said, “I have not failed 700 times, I’ve succeeded in proving 700 ways how not to build a light bulb.”

Now, the Thomas Edison that said this wasn’t the one that you see in the picture here.

He wasn’t the guy who was still trying to make a name for himself, still working to put food on the table, still trying to become wealthy.

It was this Thomas Edison, who years later after having already been credited with all kinds of inventions, had already become a wealthy man.

In today’s dollars worth about $200 million.

Whose closest friend was none other than Henry Ford, one of the single richest men that’s ever lived in history.

It’s also after J.P. Morgan had already funded transitioning his lab Menlo Park into the first ever industrial research lab.

You could make a really good case that this Thomas Edison could afford to fail 700 times.

In fact, sometimes when you see this quote it says 10,000, which even better makes the case.

So, you know, one of those notions that has come in vogue in the last 10 years is, “We should fail, we should fail often, we shouldn’t be afraid to fail because eventually we’re gonna find that nugget, we’re gonna find that light bulb.”

Well, Edison, Ford, J.P. Morgan, they could afford to fail 10,000 or even just 700 times.

Now, to give you another perspective on what that failure could look like, here’s one from our experience.

The guy you see in this picture is Gene Kranz, one of the very first Flight Directors ever.

Flight Director that’s featured in the movie Apollo 13, played by Ed Harris, you know, the fellow with the white vest.

The quote that is attributed to him his, “Failure is not an option.”

He said so in the movie, now in real life, Gene Kranz didn’t really say that.

However, it accurately reflects his attitude, both then and today.

It accurately reflects the attitude that he imbued into the organization as he left his fingerprints all over the way Mission Control does their job, even today.

And it makes sense that failure is not an option.

In our business, when we’re managing energies like this, where it takes millions of pounds of fire to throw things into the sky, and anything we put in orbit has to travel over 17,000 miles an hour.

And when it gets there, all that energy from all that fire is still in the system all the way until we bring them back down to the ground and they’ve come to rest.

And while we’re doing it, those are friends of ours, sitting in that spacecraft on top of the rocket.

So we’re not just putting something up in this space.

There is no, “You have to crack a few eggs to make this work.”

Every time we show up to do this job, we have to be perfect in our decision making, we have to be perfect in our outcome, we can’t go into it accepting that failure is an option.

Lessons learned

So with those two quotes, those two ideas as, sort of the bookends, I do wanna share with you some about how we go about this lessons learned process, or what you call postmortems.

First, a little context for life inside the Mission Control Room.

This picture is inside the space shuttle mission control room.

Not quite two years after the Columbia accident, as we were first starting to train again to get ready to fly after that accident.

In fact in this case, we were practicing a rendezvous where we fly a shuttle up to and then dock to the International Space Station.

Everybody that you see in the room there is some expert on some part of the spacecraft or some technical discipline.

There’s an electrical expert, a computer expert, there’s a trajectory analysis expert, there’s somebody that’s all about the robot arm and the spacesuit, and pretty much anyway you can dissect the spacecraft or the engineering disciplines there’s an expert in there.

Many of them, in fact, have other people in other rooms that are whispering other advice, helping them analyze the data for their contribution.

Each of them is then responsible to make the recommendation to the Flight Director, the white vest guy, which thankfully, in my day, we weren’t wearing the white vest anymore.

And the Flight Director is then held accountable to be responsible for everything that happens in the flight, to make sure that we always protect those astronauts, bring them back to the ground and get the mission completed.

So on this particular day, it was probably about a day-long simulation.

And our simulations in the room could last anywhere from four hours to maybe 10 or 12 hours, sometimes we stretched days together, and we had three teams come in just like we do during a flight.

We have astronauts over in the simulator, the real mission controllers sitting in the real mission control.

And then a team of instructors sitting upstairs sort of playing God, breaking the spacecraft, making it hard for us to do what we do.

At the end of this, every time we do it, no matter what level of training it is or what we’re preparing for, then we pull the team together and we debrief.

And this is where we start our lessons learned activities.

And what are we debriefing?

Everything that happened and we do all of it as a team.

And it’s not really rocket science.

So we take everybody who participated, anybody who made a decision, took an action or recommended an action during the operation, they are part of this discussion.

So the astronauts in the simulator are part of this conversation, the instructors are part of the conversation, certainly everybody that contributed, everybody that was part of delivering the product, if you will, participates in the discussion.

Oftentimes, it’s not any more complicated than the Flight Director just going around the room, starting in the front, “What did you see and what do we wanna talk about?

What went well?

Well, you know, we saw these events, we had these failures, we responded by the procedure right out of the book, we really didn’t learn anything but we did validate.

The procedures are in good shape, the plan is in good shape, the software is all still working great.

We’re ready to fly."

Just as importantly, maybe even more importantly, we wanna talk about, “Well, what didn’t go well and why didn’t it go well?”

This is where we would hear flight controllers say, “Well, I missed a call.

It’s my fault.

Here’s how it happened.

We had a number of failures and we were chasing those failures and while we were chasing those failures in how our system affected the other systems and vice versa, we missed the fact that one of the computers had a card go down and some of the data we were basing our decision making was no longer real-time data.

And we got fooled, the data kind of lied to us but it was our fault, not the data’s, we should have seen that and we could have had backup cues."

And you hear a call like, call like that during these debriefs, while everybody talks about, “Here’s what I could have done better.

Here’s the mistake that I made.

I got two digits inverted on a command and completely screwed it up.

Here’s how that happened.

Here’s what I should have done that would have prevented it from happening."

Even as the leader, it was more than once, I would say it wasn’t unusual but it was more than once, at least, that as a Flight Director I would start the debriefing with, “Okay, this one was rough folks.

This one’s on me.

I know I gave this direction early on and set us off down the wrong path.

As we go around the room I wanna hear from any of you that that contributed to problems that you saw and how we proceeded and could have gotten in our way and kept us from being successful."

Now, the other thing we talk about is where did we get lucky.

So something happened that we didn’t respond to and do the right thing but it didn’t matter, we docked or whatever our mission was, we got it done.

We wanna talk about those things too.

That might be something like, “We got in close in one of our backup rendezvous sensors,” you know, the things like a radar or a laser that we use to measure how close we are to the space station, how fast we’re approaching the space station.

One of the backups goes down.

We didn’t see it.

It ended up not being a big deal because the primary sensor kept working.

No harm done.

What would I expect?

The system experts whose job it is to manage those systems to tell us in the debriefing, “Hey, we got lucky.

We saw backup go down, here’s how we missed it and we shouldn’t have.

If we had had the primary go down when we were in close, we would have had to abort the docking or we likely would have bounced off, which by the way is not good when you’re trying to dock."

And then when we finish those discussions, what do we still need to work on?

What procedures do we need to change, what part of the plan, what part of the software, which parts of maybe just knocking the rust off on how we are working together as a team, but which things got in our way, what do we need to work on?

And based on the answers to all those questions, who now has the onus on the team to pull together that response because we don’t wanna just make the observation and leave it out there.

We need some one of those experts or multiples of them to go put their heads together and then come back and tell us, “Here’s how we’re now gonna respond and fix that.

So the next time we train, the next time we fly we don’t see that problem again."

Throughout all of it, the focus is on the team being better the next time we show up.

The focus is on us being more successful at what we are delivering in the product, in our case, protecting the astronauts, protecting the spacecraft and then getting the mission accomplished.

But it’s all about getting better.

It isn’t focused on how do we find the Nimrod that made that mistake and punish that person, nail them to the wall so everyone knows never make mistakes.

It’s all about making sure that we get everything out there so that we learn as a team and keep getting better.

Mission Control’s four keys

There’s four keys to doing this well that we have actually turned into a science, if not an art form, in Mission Control.

All in

The first one is we’re all in, the whole team participates.

Every time we debrief, anything we do, everybody that participated in the operation has to be part of the discussion, has to be part of the lessons learned.


There’s also strong alignment to what we exist as a team to do, so that that primary purpose of protecting astronauts and the spacecraft, and then accomplishing the mission supersedes everything else.

It’s more than just how good each one of us is at what we do because no matter how good I am at what I do, if somebody else makes a fatal decision, then we can’t succeed.

We hurt the customer, in this case we hurt the customer in a real way.


Part of this also means we have to be willing to pass judgment and we have to be willing to accept judgment.

You know, as a leader I have to be willing to say, “Wow, of all the simulations I’ve ever led, this one was terrible.

We have to be better than this.

This isn’t up to any of our standards, whether it was my fault or other parts of the team’s.

Let’s talk about those things, and by the way, let’s not shy away from accepting, yeah, this one’s on me. Here’s how I made this decision. Here’s how maybe as the leader, I intimidated some flight controller from telling me what I wanted to hear."


And then lastly, all cards have to be on the table.

We have to be fully transparent.

Otherwise, we risk not actually learning the things that we needed to learn.

In fact, we risk something we call negative training where we reinforce something must have been okay because we didn’t talk about it, when in fact we just didn’t get it on the table when we should’ve talk about it.

And then lastly, we report out.

Now, honestly when we do these training runs, reporting out usually doesn’t mean anything more than we tell each other and maybe we tell the other people who are gonna work this mission with us.

Every now and then we’ll see something in training that raises a flag and we’ll say, “Hey, we haven’t made this kind of mistake in a while.

This is not a good indicator.

Let’s get that word out to our management team.

Let’s alert the rest of our ops community to be concerned about this."

In flight

So this takes us to real flight operations.

This happens to be a picture in the back of that same room, the space shuttle flight control room.

This one is during the actual flight.

First time we flew after the Columbia accident, about 23 minutes before we dock.

So this coincidentally, I picked out a picture that was from the same timeframe as that picture before, which by the way, is maybe not as cool to any of you as it is to me.

Now, we have a mantra in the Mission Control world that is, “Train the way you fly, fly the way you train.”

What that means to us then is, of course, the way we do business in the control center when we’re actually flying, is the same as when we’re simulating, including this debrief idea.

Now, we don’t debrief ourselves at the end of every shift, but what we do is when the next team comes in, they now have logs that each of us has taken throughout our shift.

What’s been going well, what hasn’t gone well, what did we get behind on, what did we break.

And what does the team that’s coming on next do?

They all review those individually and then they go around and debrief their Flight Director before they come on board and replace us.

And eventually they will talk about what went well, what didn’t go well and what are these guys leaving for us to fix that they didn’t fix for us.

In fact, we kinda talk about it like that.

And for those of us that are leaving, rolls right off our back because that’s what we need them to do because we’re all in.

And we know that we broke this, we told them that we broke this and we need them to fix it so when we show up tomorrow we can just hit the ground running and make sure that we’re gonna succeed in this mission.

In retrospect

Now, when the mission is over, we pull everybody into a conference room.

Everybody who worked the flight, not just on this shift but on all shifts.

And we do the same thing that we did in simulations, we debrief.

We wanna know lessons learned.

And the process? Identical to how we do it after our training runs.

So we have everybody in the room, we wanna talk about the good, the bad, the ugly, why didn’t some things go well, what did we learn from it, how are we gonna be better next time.

And the same keys apply.

By the way, if you look at these keys to doing this well, you can also see the cues for, what does it take to not do this well, what does it take to ensure that you have negative training from these conversations?

One of them is, not everybody participates.

Maybe the veterans, the old dogs that know they don’t have anything to learn, they go on home.

They don’t have to participate, which means that the newer folks, maybe some of the folks that stumble, don’t have the benefit of their guidance, don’t have the benefit of their experience.

Plus, more times than not, some of our old hands, some of the veterans, as the discussion goes on, will hear things that they realize, “Oh, I missed that, I think we made that mistake too.”

Or, “I learned something from that, even if we didn’t make that mistake.”

You can’t learn it if you’re not part of the discussion.

Everybody has to be in.

And rather than being aligned to our common cause as a team, if the electrical expert shows up and all he cares about is, “I made all my calls exactly right.

It’s not my fault that the stupid computer guy wasn’t paying attention.

And didn’t realize that some of his computers failed so that when he switched computers he switched to a computer that didn’t have electricity.

That’s on him.

I knew it was gonna happen, but that’s his problem, not mine."

And that gets in the way of the team.

Instead of a willingness to pass and accept judgment, think about what we all learn as we become managers.

What’s my job as a manager?

Circle the wagons around my people, protect them from…

Especially the upper management who are gonna be critical, wanna find somebody to blame.

Imagine what that means then whether it’s in one of our training runs or in an actual flight if, for example, we screwed something up, we made some mistake and we didn’t dock, for whatever the reason.

And yet my attitude as the leader is, “Hey, my guys are all good.

We tried the best that we could. It’s no harm done."

Well, no harm done other than we wasted a billion dollars.

We might have put those astronauts at physical risk and now a bunch of them have to fly again and we didn’t leave people on space station like we were supposed to.

We don’t get to the learning if we’re not willing to pass and accept the judgment.

And the fact that we are strongly aligned to the same objectives, helps us do it and not be as concerned about getting blame because it’s not about blame, it’s about being better.

And then full transparency.

You know, this one is tied to that willingness to talk about things that maybe we got lucky with.

I had a flight controller once tell me that as he was coming up, he had an experienced veteran, one of the old dogs, who was watching him while he was training, chewed him out after one of their training debriefs.


Because he brought something up.

“Hey, I screwed this up, Flight.

Nobody caught it.

Here’s how it happened.

It won’t happen again.

I now understand this better."

His old mentor pulls him aside and says, “You made yourself look weak.

You embarrassed our entire group in front of the whole team and in front of the Flight Director.

That’s not what we do."

I would tell you, as someone who was a Flight Director for nine years and then was the Director for years after that, that’s the kind of person I wanna find, and excuse from being part of the team because it’s absolutely part of our culture to put all those things on the table so that we all learn and there’s no negative training.

And the last part of this lessons learned discussion after a real flight is, we report out and up, and it’s all full and open.

And when I say out and up, I don’t mean just to the people who work the flight or the other operators, I mean our entire operations community and our entire management chain all the way up.

And in fact, we’ll go to our customer management chain, for us our customers are also NASA people.

So we’re not crossing company boundaries, although, that being said, our lessons learned also get distributed out to all of the contractors that support NASA.

Now, when we do these lessons learned, we do it in this conference room, which I admit is kind of a fuzzy photograph but it reminds me of a question that I was asked a couple of weeks ago about these lessons learned discussions, and in particular, who is allowed to attend or who is allowed to participate?

So I have two things to point out from this picture.

One is, anybody is allowed to attend, anybody in our community can just show up.

One other thing I should point out is, this looks like we only have a handful of people, it’s unfortunate positioning of the camera you can see less than a fourth of the people that are there.

I say that just as a reminder that everybody that worked the flight is in that room somewhere, you just can’t see most of those folks.

But there’s a lot of other people because anybody in our community is allowed to be there.

The people who are required to be there, are all of those people who participated in the team and were part of the operation.

And the other thing I would point out from the picture is you can tell that I am turned away from the table and I’m no longer talking to the leads or the big dogs who pulled the mission together, I’m now talking to somebody in the peanut gallery who has… We call it the peanut gallery by the way, who has either asked a question or made some point.

And we’re now talking about now, what does that mean to us?

Whether they were to say, “Hey, I think you guys are being too easy or you didn’t talk about something” or maybe they’re just answering a technical issue.

Anybody can attend, anybody can participate but everybody that contributed has to be there to participate.

When we’re finished, then the leader puts all that together in a report, and as I said, they send it up and out.

Then we also convene the senior managers where the leader goes through here are all those top-level lessons learned, the good, the bad and the ugly, in conference rooms that look like this one.

And again, just because of the camera angle, you can see only a fraction of the people.

And the reason I tell you that is, again, many of those people that participate on the ops team are scattered throughout the audience, even when we’re talking to the upper management.

Because again, anybody can attend, anybody can participate.

And as the managers start asking hard questions, it’s not unusual for the leader to sit back and say, “Well, you know, my life support expert is over here.

Let her come up here and answer this because she’s way smarter than I am on this.

Maybe we’ll both learn something."

Which brings me to my last example of postmortems where our lessons learned focus.

In this case an actual postmortem.

So this is me and a fellow named Doug White as we were each testifying at the presidentially appointed Accident Investigation Board after the Columbia accident.

Both talking about the current status of different teams, he and I were leading as part of NASA’s investigation.

Now, after an accident like this, of course, we do the same thing.

There we go.

I thought I was having a failure there.

We review our lessons, we review everything that we did for lessons learned as a team.

And by the way, this isn’t just when we have catastrophic accidents or catastrophic failures where rockets blow up and people die, it’s when we have some significant event that is so outside of our experience base.

So we had damage, we had significant loss of money, we had risk to our people, something that is not normal, even for a business where we frequently will have components fail or things like that.

Any of those things we treat just like this, but we also approach them the same way we do those debriefings in the lessons learned discussions after a simulation or after a flight that landed successfully.

And that is, we wanna know what happened, how did it happen and how do we keep that from happening again.

And specifically, we want to know the root cause.

We don’t just wanna know where we had this computer fail and that caused everything else to come apart.

Well, what caused it?

Did we have a manufacturing problem, did we have an installation problem, a software problem, did the astronauts throw the wrong switch, did the mission control send the wrong command?

What caused that ultimate problem and what caused that, how did we not catch that?

Did we have a process escape?

So was there some quality process that was intended to catch whatever this root cause was and didn’t?

How did that happen?

Or even worse, our quality process did catch it, we have paper that shows it and we didn’t take the corrective action.

How the hell did that happen?

And then, what could we do, what should we do next time, what could we have done this time that would have led to a better outcome, that would have been less catastrophic or a lower or less severe kind of failure?

And then, what do we need to change?

Throughout all of this, again, it’s all about lessons learned, it’s all about how we keep this from happening again, but not just this particular failure.

We call that fighting the last battle.

And anybody that fights the same battle over and over can win it every time they fight it after that first time.

What we wanna know is, based on what we learn from this occasion, the lessons learned today, how can we apply that to other similar risks and get out in front of those and do a better job mitigating those risks and making sure we’re successful, if we’re presented with those problems?

And the keys to doing this well, even after a catastrophic failure, are exactly the same.

Everyone’s in, everyone’s aligned to what we exist to do, everybody’s willing to pass and accept judgment, and we have to have full transparency.

And again, when we finish we report out and we report up and it has to be full and open.

One thing that’s interesting about the up part, sometimes you’re not actually reporting up, after a significant failure, depending on how high the failure is, we’ll get somebody from outside of the operations community, maybe even all the way outside of NASA to lead the investigation because it artificially enhances that transparency, that sense that, “We aren’t hiding anything, we’re not being defensive, we are gonna get to the bottom of this.

We are going to learn and be better the next time."

Review to improve

So, today’s Mission Control sits at the end of almost 60 years worth of incredible performance and they are as good today as they ever were on their best day.

I don’t care when that was, shuttle timeframe, Apollo timeframe, the folks that are there are miraculous at what they do.

A significant, and as I said when I started, a less obvious part of why they’re so good at what they do is this focus on lessons learned, or as you call them postmortems.

And that focus is part of the culture and it’s a part of the culture, and it’s an attitude that can be applied in any industry, including yours, and like so much of our culture, it ain’t rocket science.

It just means you have to be willing, not just willing you actually have to engage and review every operation, every delivery, the good ones, the ones you stumble on and the bad ones, the failures.

And when you do, keep the focus on making the team better, making sure you’re more successful next time than you were this time.

And by the way, you know, sometimes do you have somebody that in fact things point back to and say, “We now have a problem with this person?” Maybe.

Usually that involves somebody who wasn’t aligned and wasn’t doing what needed to be done, not somebody who made a mistake.

More than 99 times out of 10, there’s some part of how we are doing things that we can now tighten up and we can be better as a team.

And the keys to doing that in any industry are these same four things.

And while these are attitudes about how you do postmortems, how you do lessons learned discussions, just like with us, this can become a key part of our culture that just like with us, can help your team continue to be more and more successful just behind the scenes, no matter what your rocket science is or what your technical discipline is.

So, with that, I hope this helps and I thank you for listening.