This is a transcript of episode 331 of the Troubleshooting Agile podcast with Jeffrey Fredrick and Douglas Squirrel.

You’ve never been in a plane crash because accident investigations are blameless, but do you know how to run a blame-free postmortem?

Listen to the episode on SoundCloud or Apple Podcasts.

How Are Airplanes So Safe?

Listen to this section at 00:14

Squirrel: Welcome back to Troubleshooting Agile. Hi there, Jeffrey.

Jeffrey: Hi Squirrel. Hey, so we had a funny thing happened last week, and I thought it would be great to discuss-

Squirrel: A funny thing happened on the way to the podcast. Sounds great. What was it?

Jeffrey: Exactly. So the funny thing, and okay, maybe it’s not that funny, but it comes in two parts. And here’s the story. You and I, years ago, over a decade ago, like in December 2011, had a discussion about how managers could learn management theory. And we came up with idea of a monthly discussion group.

Squirrel: I think you’re giving me way too much credit. I think you brought it to the company that I hired you into, and I was eternally grateful.

Jeffrey: Hahaha so since that point in time, you were very encouraging. Since December 2011, I’ve been part of a monthly discussion of management theory, different material discuss. And the one on Friday was one that made me think of you and for this podcast. And it has to do with an article we’ll link in the show notes. And it was called Why You’ve Never Been in a Plane Crash. And it talked about the history of aviation safety in the US and the way that the NTSB, the National Traffic Safety Board, I think that’s the right National-

Squirrel: Transportation.

Jeffrey: -Board. Yeah, yeah, yeah. Thank you. They do investigations into crashes to make sure that they understand what went wrong, and how to fix them and prevent those crashes from ever happening again.

Squirrel: And not only crashes. It’s really important. Not only crashes, but near misses. So things that were almost crashes get reported to them and they say, ‘ah, so you managed to dive at the last second, but how on earth did you get that close to the helicopter in the first place?’ That’s the kind of question that they ask so that it doesn’t happen next time.

Jeffrey: Yeah, what you just got to is actually really key, that a lot of what they investigate is self-reported from people saying, ‘hey, I was almost in an accident.’ And the way that they’re able to get people to self-report is because they have this idea of a blameless postmortem, where the accident investigation and near accident investigation that they do is legally not admissible in court.

Jeffrey: So in cases where there have been an accident, where someone has made a mistake, even in the cases, and this is one of the cases they talk about in the article, places where people have literally died as a result of someone making a mistake, they’re not going to go and punish that person. And instead they say exactly what you said. ‘How did we get in a position where a mistake by a human, which we should expect, could result in people dying?’.

Squirrel: And there’s a great example of that, that listeners have probably seen misreported. And if you’ve ever seen the movie Sully. S-U-L-L-Y, which is about The Miracle on the Hudson, where a pilot called Sullenberger, and a bunch of other people, landed an airplane on the Hudson without killing anybody. And what happens after the event, actually, in the movie, you see the investigation before the event, and the investigation is very oppositional. The investigation is about how Sullenberger screwed up, and Sully is in trouble, and he’s bad. Why didn’t he figure this out sooner and go land at a normal airport instead of putting himself down in the Hudson?

Squirrel: Nothing like that actually happened. It’s done like that for drama in the movie. But in fact, what happened was a bunch of people sat down together and said, ‘now what could we have done that would have changed this?’ And what they concluded was Sully did everything he should, that it wouldn’t have been possible for him to go to a normal airport. He needed to be on the Hudson and that they needed to train people in how to do that and do other stuff to deal with it. So it’s a perfect example of what doesn’t happen, but what Hollywood would like you to think happens.

Jeffrey: Yeah, exactly. And so that’s what the article is really about, is that they have these non-adversarial postmortems. Like, let’s get to the bottom of this, because we all care about making safety better, and that in places that don’t have it, it uses an example from Canada where at the time the judge investigating an incident could recommend people be prosecuted. They talk through the case where it looked like the person involved in the accident had every incentive to lie and mislead about any culpability on their part, because of course, they don’t want to be prosecuted.

Jeffrey: So it was making this contrast. And people who are in software have probably heard about this thing called blameless postmortem and that’s what this is about. And saying that the way that we learn is not from trying to blame humans, but instead by figuring out what’s gone wrong in the system. And in our discussion on Friday, we talked about this in safety culture; we tied it into the concept of psychological safety; brought up the example that came up with Amy Edmondson, where she was looking at, nursing teams, and she expected to find that nursing teams with the highest error rates also had the worst patient outcomes. But it turned out to be the opposite, because the key thing was that the error rates were self-reported, and it was the teams that had enough psychological safety to say, ‘hey, I made a mistake,’ that they could learn from it, fix the system, and they had the best patient outcomes. So we were tying all this together. Now this by itself, I think, would be enough for us to bring up on the podcast. There’s a lot of interesting things to learn from this, but that’s not actually why I wanted to discuss it.

Root Cause Analysis Starts with Client Impact

Listen to this section at 06:00

Jeffrey: The reason I thought it was worth discussing and particularly wanted to discuss it with you, Squirrel, is because both before and after this conversation, I was involved in root cause analysis of incidents within the company, and we had people who were running these incident analysis, running these RCAs and in my view, they didn’t do a great job of it. They didn’t really know how to run root cause analysis. So they were willing to, but they didn’t really have the right mindset. And one of the things I remember back from our time working together is that you had very definitive views about how RCAs could be run, and I thought it would be great to talk about your process for RCAs, and that’d be so good for our listeners. So what do you think? Are you willing to do that?

Squirrel: I’m totally willing and completely forgetful. So I remember being very opinionated about it. I remember having rules and I don’t remember what any of them were because it’s been over a decade.

Jeffrey: The thing is, I’m pretty sure, and I know this because I checked, for people who would like to know you actually have this on your website. So we can say, people who want to know Squirrel’s rules for this you can find a recording to video and slides on the Squirrel’s website and on the page that says ‘How I Work’.

Squirrel: Oh, great. We’ll put that in the show notes, and then I’ll watch it because I’ve completely forgotten it.

Jeffrey: I do want to bring in some of your rules, and you can maybe talk about what you remember of them, because I have a feeling when I say them, it might spark some memories for you.

Squirrel: I have this feeling I’ll remember, but I don’t right now.

Jeffrey: Okay. So number one, you said, ‘with an incident with a root cause,’ and maybe I’ll just mention for people what I mean by an incident or what we’re talking about with root cause. We’re talking about something’s gone wrong in production, right? So this is not something in development. It’s not something in your test environment, at least not in the case that you and I were discussing. It was something’s gone wrong in production. Some technical thing has happened, and something’s broken. And you want to kind of say, ‘well, how did this happen in production?’

Jeffrey: The key thing for you, is you said, ‘the root cause analysis meeting that we have, it’s not going to start with talking about that technical thing.’ The first thing you said we’re going to start with is talking about client impact. Do you remember that rule?

Squirrel: I do now that you remind me. And that reminds me of a great case where I used it, where an e-commerce company that I was working with at the time had an outage on something called Black Friday, which should send shivers down the spine of anyone who’s ever worked in any form of commerce. Because Black Friday is when they would make half their income for the entire year on a single day.

Jeffrey: Right!

Squirrel: And so something went down. If I remember it correctly, there was something wrong with one individual product, and it was misconfigured or wasn’t set up properly. And when we had that product on the home page, the home page wouldn’t come up. When we took that product off, home page would work. Those of you who have worked with computers know what this is like and how frustrating it is. We didn’t start with that. We started with how much money the company lost by not having anything up on the home page for a long time until we figured that out and with the effect that had on customer service and elsewhere. Then we did the analysis of what that meant and what we could do about it.

Jeffrey: Right. And that’s a great example. Long time listeners will remember us talking about that incident before, not in terms of the RCA, but rather the relationship you had with the product manager, because before you actually spoke to them, they were really angry because they thought you had just sort of taken their product off the page and been like, ‘oh, well, it doesn’t work. No worries about it’ Didn’t understand the whole behind the scenes timeline of trying to make things work.

Squirrel: Which is why starting with the customer impact was so important, because it made it very clear that what we were after, in that case, was reducing that customer impact. We wanted to make sure somebody could sell something, and we were certainly trying very hard to sell every product. It’s just that we had done something wrong with this one. We weren’t going to figure that out on the day.

Jeffrey: Yep. And now the question here that people might be saying, ‘why do we care about customer impact? Don’t we care about just solving the problem?’ And the answer is because our technology is part of a social technical system that our company provides. In other words, it’s not just the technology, but there’s also people in, say, support, or customer service, account managers, and those clients that are having trouble at our website, they often call them and say, ‘hey, your site’s broken, what’s going on?’ And those people, who are part of the system, their response to the clients might be, ‘yes, we know about it. We’re aware of it. It’s going to have this impact. We’ll let you know what happens.’ In other words, they need to be in the loop. And we’re only going to be able to look at that whole sociotechnical system if we start from the client point of view, not the inner technical view.

All Action Item Should Be Complete in a Week

Listen to this section at 11:22

Jeffrey: So that’s number one, understanding what’s the client impact is our starting point. On the other end of your rules, the other rule that I remember that was very interesting, is you said the outcome of the RCA, of course we want to fix the immediate problem, but we’re trying to understand how did we end up here? What were the things that led to us, and what are the things we could put in to prevent this from happening again? You’d have these RCA action items, and your rule was that we should complete all the action items within a week.

Jeffrey: In other words, this has two constraints. One is we can’t imagine like, ‘oh, we should take six months and rewrite the system.’ That wasn’t a valid thing to put down as your action. And the second was whatever we said was important, that we wanted to address, it had to be small enough that you could deliver it quickly. And then second, you need to actually deliver it quickly! So you actually needed to get stuff out to fix it, not just talk about it, not just add it to the backlog of possible future enhancements.

Squirrel: And in fact, we had a whole process for checking that those actions were completed. Now, something that listeners might be bothered by is they might say, ‘well, wait a minute, maybe I really do need a month to fix something that’s really heinous and horrible, that makes a product not work on the home page, or that deletes patient records or something like that. Squirrel, it might take more than a week to actually finish that work, especially since we already have other tasks to do. And, you know, you’re adding this new thing. What do you expect us to do?’

Squirrel: I don’t expect the whole amelioration, the mitigation of the risk, the fix for the problem, to be completed in a week. But I expect the action from the root cause analysis to be finished in a week. So that might mean that we add a feature to a backlog that we know we’re working through. So it might not be that we get to it until a sprint four weeks from now, but we know within the week it’s added, and it won’t be forgotten.

Squirrel: In another case, you might say that we’re going to institute some new mechanism, and we’re going to make sure that before each monthly report, we’re going to do a certain thing to check that the database is in a good state or something like that. Well, that might not happen for a month, what do we do then? Well, what we do is we put it in the diary! So we make sure it’s in the calendar, we’ve invited all the people, we’ve written down what the agenda is. We know we’re going to have this meeting two days before the monthly report goes out. Then we can complete that task within a week and check on it, and we’ll be sure that when it comes time to actually get bit by the bug again, we will have taken the mitigated mitigative action so that it doesn’t happen.

Before the RCA, the Timeline

Listen to this section at 14:13

Jeffrey: Yep. All right. So those are two rules. I have two more. One, I think that you had as your rules, and second is one I added later. So the first one is one I think that you may have had, which is about before the root cause analysis we should have a timeline of all the events that are relevant, do you recall if that was one of your rules or one that I added? If it doesn’t sound familiar, it might have been mine.

Squirrel: I think it is mine. This squirrel guy, he’s really smart. I’m impressed with him. You should definitely go watch his video. I don’t remember any of this independently, but now that you remind me of it. Yeah, it’s all coming back. We would walk in, and we would have a timeline already drawn out on the whiteboard before we had the discussion. That really helped us to remember, ‘oh yeah, actually, we tried three different products before we figured out it was this product that took down the home page.’

Jeffrey: Yep. And one thing I’ll say, and this kind of goes back to what we’re trying to capture when doing an RCA, I want to capture the whole system. So it’s not just the timeline of what we did once we knew there was an outage. I want to have everything that’s relevant, going ahead of time.

Jeffrey: So, for example, in one of the cases that we recently looked at, there was a certain problem that would happen, a timeout that would happen, and there would be an error logged every time that that event happened. Then it would try retry ten times before it would stop. And what I wanted the timeline include is not just the point at which it stopped, and therefore the process failed, but all the preceding ten times! Those were opportunities where we could have potentially detected it. It was part of the timeline events, we might have retried and avoided the outage. So I want to have the upstream stuff from the point of where the symptoms would have been possible.

Jeffrey: And, I want to continue downstream through any client communication elements. So if there was follow up with clients after the things were resolved, or if there was, say, we had to go fix stuff in the database, right? All the elements we had to do, not just the immediate parts around the outage. So because we’re trying to improve the whole system, we want to try to capture in the timeline all the contributions and impacts on the system. So that’s the timeline.

Squirrel: I do remember that one. Then you say you, you added something, which at this point everything sounds added to me. But, remind me, what did you add?

Jeffrey: The one thing that we added, we come into the room and what we found sometimes is, is that in our eagerness to have the RCA and start getting remediation in place, that sometimes people hadn’t gotten to completely the bottom of exactly what happened technically. And so we added what we called the technical investigation and made this a distinct step as a prerequisite to the RCA.

Jeffrey: Because what would sometimes happen is people would come into the RCAs, and they would say things like, ‘well, we THINK this is what happened. There was this message, and we THINK it was caused by this.’ And it was like, well, whether it was or was not is very material to what we do next. So, we made it a prerequisite to say that we actually had a deep technical understanding of exactly what happened, we felt we could reproduce it. So that we actually knew that we wanted this grounded in technical reality.

Jeffrey: Because the point about this RCA is we wanted to all have the same shared facts. That’s A, the purpose of the timeline. And then B, the technical investigation really makes sure we’re exactly grounded in what happened and what the constraints are, because sometimes I’ve gone to RCAs where people lacked this, and they were speculating about the cause. They have a remediation that gets approved in the meeting, and they go do something. And then the same problem happens again, because it turns out that actually wasn’t the root technical cause.

Jeffrey: Therefore we augmented the three elements that I think you’d had as your rules with this forth one of a technical investigation. We found that by adding that in, we had a much more productive in the room conversation about what might have technically made a difference or not rather than speculation. So I don’t know if that sounds familiar to you. I’m pretty sure that was a later addition.

Squirrel: It does sound like it, but certainly a very valuable one. So did you get anything else from this very clever video that somebody made?

Jeffrey: Hahaha well, I gotta say, I haven’t seen that video for a decade because I had seen you do your training firsthand. And I remember when you went out and did the video and had it recorded, but I had the benefit of being in an interactive training session. Part of your training the company in doing RCAs is, originally, you would personally facilitate all the RCAs so that they followed this format. And then second, you had a process by which someone could learn how to do the RCAs. They had to do a training with you, and then they would watch you facilitating an RCA, and then you would watch them, before they would be signed off to be an independent RCA facilitator. And that was why I have exposure to your process for doing it, because I took your training.

Squirrel: Fantastic. Well, I had a lot more time then, I think. That sounds like a more involved process than I would necessarily institute today, but I’m glad that this artifact remains. I’m now very keen to go watch it. And I hope that listeners will, too. I’m certainly going to put it up on my forum, The Squirrel Squadron forum, and invite discussion there, so we can delve a little bit more into how to prevent catastrophes before they happen.

Jeffrey: Yeah, absolutely. And I think it’s something that really fitting for our audience and our overall message about learning loops and the idea that this is one type of learning loop, and it’s not enough just to have the intent, but you might also need some actual practice with it. You need to get better at your learning processes. Learn how to learn, not just assume that it’s a good idea and that therefore you know how to do it well.

Squirrel: Sounds like a good plan. Well, if listeners would like to improve their learning about learning, then apparently there’s a good video out there, and you can check that in the show notes, but as well, we’d sure like to hear from you with your questions and concerns, with your ideas about how to do root cause analysis. Hey, we never even talked about not just five why analysis, but making sure that you do the five so what’s, we’ve got to talk about that another time.

Jeffrey: True!

Squirrel: Because that’s a great idea from you, Jeff.

Jeffrey: That’s right.

Squirrel: But the if listeners would like to talk to us about any of those things, then the best way to find us is to go on over to Agile conversations.com, where you’ll find our email and Twitter and lots of other things that apparently we do. The other way, of course, to keep in touch with us is to come back again next Wednesday when we’ll have another edition of Troubleshooting Agile. Thanks, Jeffrey.

Jeffrey: Thanks Squirrel.