This is a transcript of episode 362 of the Troubleshooting Agile podcast with Jeffrey Fredrick and Douglas Squirrel.

Do you document and analyse failure?

Listen to the episode on SoundCloud or Apple Podcasts.

Donald Knuth & Documenting Failure

Listen to this section at 00:14

Squirrel: Welcome back to Troubleshooting Agile. Hi there, Jeffrey.

Jeffrey: Hi, Squirrel. So, Squirrel, you told me you had something interesting to discuss documenting failure, and I love this idea. So tell me, what’s the idea? And where did you get this?

Squirrel: I was, as you do, just perusing the incredible writing of a very talented, probably the most talented computer scientist in the world, Donald Knuth. He turns out to have written something that’s really relevant to our listeners. A lot of what he does is so technical that many of our listeners would not be interested.

Squirrel: But what he also did, just because he was bored, he wrote an entirely new type of software, the first and only version of this software, and it’s now used by every mathematician in the world. So that’s how good this guy is. And that software is called TeX — T-E-X — but it’s pronounced not like the state of Texas, but with a “kh” the end.

Squirrel: This piece of software, he then analysed in great detail for all the failures, and all the things he didn’t do right, because he’s the only programmer. Nobody else writes software for this thing. So it’s a very unusual example of an extreme test case. We normally have teams, and they have dependencies on other people, and they have requirements and product and all that! And where did something go wrong? You can’t figure out who it really was.

Squirrel: In this case there’s only one guy, it’s Don! It’s just him and nobody else. And so it’s a wonderful read, and we’ll link to it in the show notes. You don’t have to follow all the technical details to really understand the sorts of things that go wrong and why, and he’s very humble about it and says, ‘I was an idiot here, and I really should have noticed this, but I didn’t. And here are 47 other places where I made the same mistake.’

Squirrel: But I thought the most interesting bit was actually on page two. And that’s where I would really like comments and help from our listeners on this one. He refers to a book that I haven’t been able to find. It’s a best-selling book. And it was written by the grand uncle of somebody named Paul Vitány, who’s another computer science expert. And the book is for civil engineers, so you would think it wouldn’t have anything to do with software, but in fact, it is devoted entirely to descriptions of foundation work that proved to be defective—

Jeffrey: Oh!

Squirrel: Which I think is a nice way of saying buildings that fell down.

Jeffrey: Right!

Squirrel: There’s a fantastic quote that I’m going to read out in full just because it’s so good– oh, but he has a footnote! Maybe we can find this book. We’ll put it in the show notes if we can, because I’m desperate to read this. It would be good Christmas reading for any of us.

Squirrel: Here’s the quote: ‘It is natural that engineers should not wish to draw attention to their mistakes, but failures are sometimes due to causes of which there has been no previous experience, or of which no information is available. An engineer cannot be blamed for not foreseeing the unknown, and in such cases his reputation would not be harmed if full details of the design and of the phenomena that caused the failure were published for the guidance of others. To be forewarned is to be forearmed.’

Squirrel: I just thought this was a great philosophy, a real super example of humility in a technical person and there’s tremendous learning to be gained both in the detail and in the attitude that perhaps it would be really good for us to look into why our software systems fall down. What do you think, Jeffrey?

Jeffrey: Well, absolutely! What’s interesting about what you’re describing here is in the one hand, what he’s saying, ‘hey, we should learn about and study why our systems fail.’ And we have the process of doing a postmortem or a post-incident analysis, whatever we kind of might call it. But part of what he’s pointing out here that is, I think, very different is actually publicizing it to the world, whereas what I experienced is a lot of people treat those postmortems or incident analysis as more narrow. Like, ‘how do we prevent this mistake from happening again?’ In kind of a small sense, we want to fix the system and then make sure doesn’t have it again.

Jeffrey: But the idea of having a larger body of knowledge, like these are the collected mistakes, that’s something that certainly speaks to me. And I think is not common, nor, having read the paper by Knuth, the level of analysis that he goes into in his mistakes. He doesn’t just document every mistake, he also goes and sort of creates a taxonomy of the types of mistakes, in other words he’s generating higher level knowledge from it to say, ‘well, what are the classes of errors I’m making?’ And I think from that higher level view, you can have a richer set of learning.

Squirrel: And of course, we’re not suggesting to listeners, although we’d love it if you did it, that you need to create taxonomies and analyse of the last ten years worth of bugs and so on. Knuth is kind of nuts, in this regard, going to much greater lengths than you might expect. But what you might do is inquire of your team, ‘when’s the last time they looked at their mistakes?’ Now, what they’ll probably tell you is that they followed the scrum book, and they did a retrospective. But a lot of retrospectives I’ve sat in have not had the sort of character of the paper we’re referring to. They are kind of ‘well, you know, we didn’t quite do this right. And okay, we’ll ask for more requirements next time, but we basically did okay.’

Squirrel: It would be much more effective if there were a few foci, a few areas of real clear direction in which somebody said, ‘you know, I screwed up here, and we could make sure that happens less often, and I could make fewer mistakes if…’ And then they finished the sentence. What do you think, Jeffrey?

Jeffrey: I agree completely and wholeheartedly with this. I’m a huge fan of doing actual, in-depth, postmortems and retrospectives. And in fact, my consulting company, the name of it is ‘Reflect Adapt’ with the idea that learning from what we’ve done is a hugely valuable activity and one that deserves, generally speaking, more attention and energy than I think it tends to get.

Squirrel: Fantastic. All right. Well, if listeners have done any reflection and adaptation, we’d certainly like to hear about it. And of course, the other way to keep in touch is to come back again next week when we’ll have another episode of Troubleshooting Agile. Thanks, Jeffrey.

Jeffrey: Thanks, Squirrel.