This is a transcript of episode 311 of the Troubleshooting Agile podcast with Jeffrey Fredrick and special guest Gene Kim.

Phoenix Project author, Gene Kim, is back on Troubleshooting Agile to discuss the groundbreaking theories of organizational management described in his new book, Wiring the Winning Organization. In this episode (part three of three), Gene discusses how and why you should be “simplifying” and “amplifying” in your DevOps team.

About Our Guest

Gene Kim is a Wall Street Journal bestselling author, researcher, and multiple award-winning CTO. He has been studying high-performing technology organizations since 1999 and was the founder and CTO of Tripwire for 13 years. He is the author of six books, The Unicorn Project (2019), and co-author of the Shingo Publication Award winning Accelerate (2018), The DevOps Handbook (2016), and The Phoenix Project (2013). Since 2014, he has been the founder and organizer of DevOps Enterprise Summit, studying the technology transformations of large, complex organizations.

Show links:

Listen to the episode on SoundCloud or Apple Podcasts.

Simplification

Listen to this section at 00:14

Jeffrey: Welcome back to Troubleshooting Agile. This is Jeff Fredrick and I’m joined for a third and fantastic, you know third time pays for all, Gene Kim, author of Wiring the Winning Organization, here to share with us the last two elements of the playbook that they put together this amazing framework. Gene, thank you again for coming back, so good to have you back.

Gene: Oh, are you kidding me? Jeffrey, it’s always amazing hanging out with you. I learn so much every time we have an interaction.

Jeffrey: Well, me too. That’s just, you know, there’s one of those conversations where it can just go on and on. So happy to have it to have you back. And the last time we talked about slowification, and in particular, I was happy to introduce what I’m excited about it, which is this is a language that I hope that our audience can use to bring to leadership, non-technical leadership, about the value of slowing down, to take time, deliberately learn.

Jeffrey: There’s two other principles, and we talked about a couple of them last time. We talked about a little bit about simplification. So I hope to talk about simplification and amplification. Let’s get into simplification here. You talked about it a bit in terms, of you have, in the book you’re using a metaphor, you’re renovating. You and Steve are in charge of renovating a Victorian hotel, and you found some classic problems of architecture, of coupling and the the deliberate or the unexpected outcome of the practice of expediting. You know, basically how you could really make a mess. And you talked about some of the principles to simplify the situation. You have three practices that you distilled, if I remember this correctly in simplification. Can you tell us about those?

Gene: Yeah. For the nerds like us and the nerdy audience. So the first is slowification. You move work from the production environment, and you shift it temporarily to planning and practice. Where it’s safer. Right? You can learn.

Gene: The second technique - mechanism - around simplification changes the nature of the problem so that they are easier to solve. And so there’s really three ways to split it. One is- and they both deal with taking a big problem and then dividing them up in different ways. So the first way that you can divide things up is through incrementalization. So you know, the notion of like instead of solving it all at once waterfall style, we do it in slices. Right. So that should be very familiar. Anyone in an Agile, incremental, uh, thing will kind of recognize why that’s good small batches. Um, so that’s the first one.

Gene: And the second two are so interesting to me that we touched on last week, which was uh, one is we modularize things. Right? So more teams can work in parallel. And so this is kind of a one way of partitioning.

Gene: And the second way is the exact same thing, but for sequential processes. When you have interdependent steps, there are ways to partition those so that we can gain independence of action. And so that’s like an assembly line, the Toyota Production System, you know they’re all enabling independence of action. And so the most famous example of modularization was the Amazon API re-architecture in the early 2000s.

Jeffrey: And the famous memo from Jeff Bezos, right?

Gene: You know. Yeah, exactly.

Jeffrey: Or even maybe the more famous Steve Yegge.

Gene: Steve Yegge!

Jeffrey: His description of it, ‘I want you all doing this by next week, you know, or you’re fired.’

Gene: Yeah!

Jeffrey: So, fantastic. And the thing was, you can, if I remember this correctly, any two teams can only talk to each other through their public API. Like that was it. That’s the law.

Gene: Right! And so, it was so rewarding for that. I mean, I had studied that case study over the last decade. Many, many times it showed up in the DevOps handbook, but I actually learned a ton in the last year and a half. And so if I were to retell it, it’s that in the late 1990s, life was pretty easy at Amazon. You know, they have everything that an e-commerce site would have shopping cart team, product page team, you know, returns, inventory, right. All those things. And then you have two product teams, they have books and music. Life is good. You can do hundreds of deployments per day, high success rate of doing deployments. But then as you start adding, more and more product categories like toys-

Jeffrey: Becoming the everything store.

Gene: Everything store, right! By 2001, they have 35 different categories, including including clothing. So books and music, you have like one SKU (Stock Keeping Unit), right? Clothing, you have 50 SKUs, right? You know, size, variety, gender, color. And apparently every one of those required a database schema change. And that’s what actually caused multi-day problems at Amazon.

Gene: So you have like now the e-commerce teams, on one axis and you have 35 different product category teams on the Y axis, all of which have to communicate and coordinate. And it led to this ridiculous situation where the Amazon digital teams, Kindle, video, music, for them to fulfill and get an order through the pipeline, they had to provide a physical shipping address, which is kind of strange.

Jeffrey: Haha yeah.

Gene: So Werner Vogels, in this article, I’ve cited many, many times, there was a phrase that he said, “those digital teams went to the ordering teams, 60 of them, and said, ‘could you please make it so they won’t have to provide a shipping address?’ And they said, ‘no, we haven’t budgeted for it.’ And now they were stuck.”

Jeffrey: Yeah.

Gene: And so this is one example of the massive amount of communication coordination that was required to even get small things done .Another marker of this is that they were siloed between dev and ops, and he had this coordination cost. So the number of deployments went from hundreds a year down to tens. Most deployments did not finish.

Jeffrey: Can I just focus on something you just said, Gene, because I think it’s something easier for people to miss, as you were describing it, and the database changes. That sounds very much like a technical problem. And I think a lot of people will hear this through in our audience will tend to hear it through a technical filter.

Gene: Yeah.

Jeffrey: But then you added something else. You said they went around to talk to all the teams. And so now we have the social part. And this is kind of like the Conway’s Law kind of situation where there’s a connection between the architecture and the organization. And I think it’s natural for technologists to often think about the technical side of the problem, and they often miss about how that introduces costs at the social side. So this is the social-technical system, and that what you’re describing here doesn’t just cause improvements in the technical side. It also leads to improvements on the social side. And it’s-

Gene: Yeah!

Jeffrey: -the whole system that improves.

Gene: It is exactly the same as the movers and painters. It’s that the layer three social circuitry was not able to do what needed to get done because of the layer one and two problems. It was inadequate.

Jeffrey: That’s right.

Gene: So what did they do about it? They said every team should be a two pizza team. Everyone should be able to work independently, to be to work on Amazon’s biggest problems. They can deliver value to the customers without the need to communicate and coordinate. We want more doers, less coordinators.

Jeffrey: Yes.

Gene: And that’s what led to the famous memo that Steve Yegge reconstructed so brilliantly. That said ‘every team must coordinate only through service interfaces, APIs’ and therefore liberate and create independence of action between teams. So that led to one monolithic code base to hundreds to thousands. And that’s how they’re doing 136,000 deployments a day by 2014.

Modularity

Listen to this section at 08:22

Jeffrey: Yeah, amazing. Now, you mentioned that there’s the way of- can we think of a similar example for the linearization of the ones where there’s interdependent steps?

Gene: Yeah.

Jeffrey: That one is going to be, I think, even more surprising to people. I don’t think they have as much reference point for that.

Gene: Yeah. So there’s something really kind of- in fact, you and I have talked often and marveled at this very strange, amazing thing that is a hallmark of the Toyota Production System, which is: how is it that you can have such a high throughput system generating 5,000 cars per day, but they also can do 5,000 Andon cord pulls per day? Right?

Gene: It should not be possible. And yet the reason they’re able to do that is that that two has modular characteristics because every Andon cord pull that happens at the edge, you have 55 seconds to resolve the issue before it causes the next level of hierarchy to be mobilized, right? So a line segment might get stopped, and then that might turn into a set of line segments, get stopped. That then might cause another escalation that causes a larger segment, to potentially the entire assembly line grinding to a halt.

Gene: But that does not happen in most cases because it has modular properties, just like you would see at Amazon. So one is for problems that have been partitioned by space, and the other ones are the ones partitioned by time. And so just to boil it down, like the movers and painters, if you can define the handoffs very precisely between the movers removing furniture, painters starting, painters finishing and painters signaling that the furniture can come back in, you now can have independence of action between the movers and painters.

Gene: Just like we can liberate and create independence of action for teams at Amazon, we can do so for people participating in sequential activities. Of which to me, I think the most surprising, I wish we had made this more clear in the book, but this is continuous integration and test. Like it liberates the build engineers from having to coordinate with the security engineers and the QA and the quality engineers, the security engineers, so they can all work independently, enabling developers to work independently.

Jeffrey: Yeah, and I think, for people want to hear more about that view, Investments Unlimited I think, does a great job of laying out an environment where you get all of that stuff integrated. And you have now the needs of security and audit and everything else being met at the same way that’s actually liberating. And a better development experience for the developers, which is a unintuitive result, I think for a lot of people who think about security and audit as things are going to be slowing them down, adding costs, adding delays. But actually the proof here, of the improvement of the system, from having that all be integrated and cohesive.

Gene: Yep!

Jeffrey: So…

Gene: So maybe just to nerd out one last time on this, just for me, it’s kind of marvelous to say, yeah, this is very satisfactory because these are orthogonal. We’ve covered now the temporal dimension.

Jeffrey: Yeah.

Gene: And the spatial dimension. There’s not a lot of dimensions left, right? Hahaha

Jeffrey: Hahaha

Gene: You’ve got space and time, all living under simplification.

Jeffrey: Yeah. Now you use the example of the Andon cord pull, which,for people who don’t know, the Andon cord is the thing at the Toyota factories where you pull it and people think, ‘oh, it’ll stop the line.’ But as you explained, no, it’s a tiered system that actually, you know, you pull the cord and you initially get help from a local work group supervisor, and then, if it’s not resolved, it then escalates and it can eventually stop the line. I think this is a good, connection to your third principle here in Rewiring, which is amplification.

Gene: Yes!

Amplification

Listen to this section at 12:35

Jeffrey: Tell us about amplification and the role it plays in the social technical circuitry.

Gene: Right, so slowification is kind of based on the Kahneman-Tversky, Thinking Fast and Slow. Simplification is based on modularity, option theory, independence of action creates amazing benefits. And then amplification, actually, we hearken back to Claude Shannon and Information Theory and Nyquist, and saying that really what we want in any good system is we need to amplify weak signals of failure, so we can act decisively upon it, so that we can prevent bad things from happening and enable quicker detection and prevention and recovery.

Gene: This kind of wonderful… For me, it was just riveting to see the Southwest Airlines holiday crisis that they had at the end of 2002 where they had Winter Storm Elliott. All the airlines have to cancel flights, but something very strange happens, which is that every airline was able to move quickly to-

Jeffrey: Recover.

Gene: Yeah, recover. And Southwest Airlines continued to get worse. And it turns out that the reason for this was that, as was widely written, was the crew scheduling system. Where at the end of each day, whenever a pilot wasn’t where they were supposed to be, they would have to call a phone number to say where the plane and the crews were, and they have to wait on hold for half hour or scores of hours. So by the time the next morning came around, the planes weren’t where they were supposed to be.

JeffreyJeffrey: Yeah.

Gene: And this is such a great example of a control system that was wildly slower than what it was trying to control, the production environment. And I thought that was such an amazing metaphor for like, how the layer three social circuitry control system, is an information system. This was a physical metaphor for this. So for me, it was just amazing.

Jeffrey: And I think you actually- I was familiar with Claude Shannon’s Information Theory, but I had missed one of the things that you have in the book, that was one of my big Aha! Moments, which was the sampling rate required by the control system. Can you explain this?

Gene: Yeah, so Harry Nyquist and Claude Shannon, the doctors, Harry Nyquist and Claude Shannon, they have this amazing proof that shows that says the receiver has to sample at twice the rate of the fastest signal, that the sender is going to send. And so kind of a proof point is imagine a sender sending a sine wave and if you only sample at the period, you will only hear one note, if this is an audio signal. Right?

Jeffrey: Right.

Gene: And so there are some very strict requirements of how fast the receiver must sample, in order to get a certain message around. And so the Southwest Airlines, was what happens when you know, it gets wildly behind or it cannot send- The information, cannot get generated, transmitted, received, and acted upon fast enough. Clearly this has some electromagnetic analogs, but another kind of aspect of this is, we can imagine systems where no one can actually tell bad news that somehow the, quote, ‘culture of the organization’ ensures that any signal that isn’t good news is stifled, maybe even extinguished entirely. And so part of the job of the leader is to create a layer three social circuitry where even the weakest, faintest signals of danger can be amplified so they can be acted upon.

Jeffrey: I want to bring these two things together, because on the one hand, you’re saying so that you can have these very weak signals you want to amplify on the second your control sample needs to be sampling at twice the rate of the underlying activity. Okay? I’ll put those two things together.

Jeffrey: Now, I think a very common practice in a lot of organizations are, we maybe have a weekly meeting, maybe a monthly meeting to review the plan, right? And this is the place where the project manager is going to say, ‘well, where are we?’ And you have the stakeholders in the room, and you have people who might be senior enough to put their weight behind bad news or maybe not. And this is the question, do things get extinguished or not? That’s a cultural element.

Jeffrey: I think you talked about the space shuttle disaster where people were trying to bring out the possibilities of what the foam strike would have done. And we need to go and visualize the shuttle, but you have project management saying no because they’re worried about dynamics. So they’re squelching it. But there’s also this combination of both of of visibility and frequency. And so and this is my thing is that, so commonly people will have a weekly planning meeting and it’s like, ‘well gosh, you can only really then have input on things that happen on the cadence of every fortnight, of every two weeks.’.

Gene: That’s right!

Jeffrey: And a lot of organizations, maybe that’s fine. But I know the organizations I’m working with are often talking about, ‘what are we going to get done this week?’ Well, if you want to know what’s going to be done this week, you can’t just have a weekly meeting, right? There’s a disconnect! If you’re thinking about a weekly cycle, you can’t just-! You need to meet at least twice. It’s a sort of-

Gene: Twice a week! In order for to have a corrective action, you have to be able to receive the signal at twice the rate of which you’re trying to act upon.

Jeffrey: And it actually, some of the best projects I’ve worked on had a really interesting characteristic, some of the ones that were most fun for me. And you know what it was? We would meet in the morning, do the sort of morning stand up, and we would have a daily demo at the end of the day.

Gene: Oh…

Jeffrey: We had sampling twice a day, where we could be very close to the work, very close to alignment and be very fast reacting to what came up. It gave me sort of this mathematical support for my visceral experience. Like, you talk about parsimony, like, does the theory validate lived experience and, the intuition you’ve dealt over time. And that was one of the elements that stood out. So that was that was fantastic.

Gene: If fact, can I add one more dimension-

Jeffrey: please!

Ratios, Not One Size Fits All

Listen to this section at 19:01

Gene: -to add more color to this. And so, like, what would cause you to increase the frequency of these kind of reporting signals? And it was actually Admiral John Richardson, former chief of Naval Operations, he said in the Andy Grove High management book. He said, ‘if the leader of a certain task or project has never done it before, then you probably want to up the frequency. If the type of problem being solved is so novel, that’s never been solved by anyone: up the frequency. If it’s highly consequential, that has, you know, terrifically bad things that happen if something if mistakes are made, you got up the frequency.’ Versus routine activities where this person has actually done a ton of times before, there’s probably not as high of a need for such high frequency reporting. So I just think all of these things, as you said, kind of give us a, uh… It says why we make certain decisions in these lived experiences.

Jeffrey: Yes, that’s right. And it’s really great there because you make it very clear that this is not dogma. This is not one size fits all. For me, the whole set of practices here, the slowification, the simplification, the amplification are like a recipe book. You know, actually there’s a recipe book that’s like called the ratio book that actually just gives you ratios of ingredients. And it’s kind of like that. You’re kind of saying go and use these things to figure out and customize it to your thing, and given your circumstances, what’s the right recipe?

Gene: I just bought- oh my gosh! I’m buying this book right now! I love ratios. In fact, I was just going to go there next!

Gene: So Steve, when he learned about the the Nyquist-Shannon Sampling Theorem, he had a similar reaction. He’s like, ‘oh my gosh!’ And the connection he made was, so when you have… Not only do you want to increase the frequency when you kind of are pushing these different dimensions, you also want to have a smaller ratio of leader to person on the line. In other words, the more dynamic the situation, the smaller team you want to have, because that two is a buffer, where the leader can actually jump in and use their experience to help. Whereas if it’s all routine, all exactly the same, then you can actually expand the ratio of associates to leader.

Jeffrey: Yeah. Yeah, that’s a really interesting part. I thought I had not made that connection myself, but when you say it out loud, I would say like what you’re adding is what’s the ratio of metacognition to work.

Gene: Hahaha, right.

Jeffrey: Right. And, so that that fits very well. I’s really rewarding for me that as I said, this is like validating so many others concepts. And I just the last thing I’ll say on amplification is, it reminds me back of the early days of Agile. One of the things that I remember, I think I got from Brian Merrick, was big, visible charts.

Jeffrey: How can we how can we have shared facts with everyone about where we are? How can we amplify that information so that you we have these facts and we can say, ‘oh, this is a problem. This is this is something that we all know about.’ And actually I learned that honestly, as a child, my father worked in manufacturing engineering, and they had a problem at their company where they were not shipping enough units. And he made a control chart that showed a burn up line and said, ‘well, if we want to ship 30 units this month, you know, where are we on the line?’

Jeffrey: And they had terrible, terrible trouble. But somehow making this control chart that gave everyone shared visibility allowed them to do a much better job of coordinating around the problem, which is why you amplify, right? You amplify it to bring awareness, bring joint problem solving. I had the good fortune to learn that literally as a child. And to see this now explained in the framework was very, very satisfying.

Gene: Can I add one other sort of like Aha! Moment-

Jeffrey: Yeah!

Shared Conciousness

Listen to this section at 22:59

Gene: -that shows up in the amplification chapter? So yeah, that notion of shared consciousness shows up in the book team of teams. And that that too requires some amplification. And sometimes when you have like, say movers and painters, I suppose you have very distrustful relationships between the movers and painters, and you need them to coordinate together. One of the most powerful things you can do is to embed a mover in a painting team and vice versa.

Gene: And one of the best examples of this that I learned about in the process of the book was the Apollo capcoms, the capsule communicators in the US space program. It turns out like whenever there’s an astronaut crew in space, the only one at Mission Control who can talk with them are these capsule communicators. And so they are astronauts. They are not just any astronauts. They’re the people who trained the astronauts or is the backup crew. And the reason they do that is that when you have a very thin channel that’s very sporadic and communication matters a lot, the best thing you can do to maximize the bandwidth between them is you put an astronaut on both sides of the channel.

Jeffrey: Yeah! Hahaha! Shared experience to have those shared points of reference that will make things so much more efficient.

Gene: Exactly.

Jeffrey: Making making the most of the situation, that scarce resource of bandwidth. Yeah.

Gene: Exactly. And just as Alan Kay once said, ‘for a very important messages, you do not send a message, you send a messenger.’

Jeffrey: Ah!

Gene: So as the as the designer of the socio parts of the sociotechnical system, there are some interfaces that you want highly coupled. And if you can’t put them side by side, you put essentially an avatar, proxies, on both sides to maximize every bit of communication between them.

Jeffrey: Fantastic. Gene, I say this has been hugely fun for me. I really hope that our listeners, if they’ve enjoyed it half as much as I have, then this will probably be one of their favorite episodes. Thank you so much for your generosity coming here, being with us for three weeks. Where can people find out more about you, about Wiring the Winning Organization? Learn more about these ideas. What should they do?

Gene: Oh, yes. I would say go to it. ITRevolution.com, that is the publisher of all these books and the Enterprise Technology Leadership Summit. Wiring the Winning Organization is available at your favorite book retailers. And I have to say, Jeffrey, thank you so much for this time and also for all the feedback and encouragement that you gave throughout the writing process. I cannot tell you how much I’ve so much appreciated and how much your fingerprints are all over the book.

Jeffrey: It’s been my genuine pleasure, Gene. It’s so much fun to talk to someone else who takes ideas seriously and delights in sharing it and the discovering the Aha! Moment. So it’s been fantastic. So thank you so much.

Jeffrey: And thank you to all our listeners. This has been another episode of Troubleshooting Agile. If you’d like to hear more from us then go to Agile conversations.com. You can find us on Twitter and email. All the information for Gene will be in the show notes. So if you’re driving, we’ll have everything there. You don’t don’t try to write down it. ITRevolution.com. We’ll put the links in there and links to favorite retailers. And you can pick that up. You can of course hear more from us next week on, when we’ll be back on your favorite podcast player for another edition of Troubleshooting Agile. Thank you so much, Gene.

Gene: Thank you.