Learning From Disasters

Livestream & Transcript

May 16, 2023

The growth industry of the 21st century isn't artificial intelligence or designer genes—it's catastrophes, specifically the kind caused by our ever-more interconnected technological systems. Autopilots that cheerily fly us into mountains, stock trading systems that go on crazed buying sprees, water quality or car emissions warnings that mask the problems they were supposed to detect; the examples multiply daily in newspaper headlines and court cases. Is Chicken Little right and the sky really is falling?

I don't think so. My clients teach me over and over that the most successful organisations are those who manage the calamities, keep them small, and convert them into learning opportunities. One tech team I led years ago blithely released buggy code multiple times a day, ensuring that we could roll back any change with just one click and thus creating many tiny, easy-to-undo crises rather than one big disaster. Netflix go even further and let loose the "chaos monkey" to knock out servers and even whole data centres, preferring a small, controlled outage to hours of system downtime and improving robustness at great speed as a result.

Join me on this free livestream to discuss what you can learn from catastrophes with the help of a world-class expert on this topic: Chris Clearfield, co-author of Meltdown. Chris and I will talk about:

Why complex systems are incomprehensible by design, and why more training isn't the answer.
The dangers of interconnecting many components and why microservices aren't the panacea they seem to be (at least not without great care in their design).
How diversity of viewpoints and soliciting opposing views can help you anticipate, plan for, and learn from your own disasters.

Join the Squirrel Squadron now to get access our live weekly events, weekly email and exec forum.

Here’s the transcript:

Douglas Squirrel (00:01):

And it has to think for a while. Connect to everybody. Okay, I think maybe we're now connected. Yes, it says that we are good and usually takes a few moments for people to appear. So I will start by saying hello to people who might be watching this on a recording because this is available of course on all the platforms where it goes out. And we love to see you there. If you're interested in engaging with these issues and asking us questions, we'll have a lot to say about how to do that. So although you won't be able to ask questions on the recording, you can certainly participate and we'd love to hear more from you. I see people appearing, which is fantastic. Welcome to the Squirrel Squadron. Welcome to my weekly event. And this week it's a very exciting one with my guest, Chris Clearfield. I'm going to say a bit more about that in just a moment. I'll just say for a moment if you don't know where you are or why you're here, it's because this is the weekly event of the Squirrel Squadron. That's my community of tech and non-tech people learning together. We've got quite a number of folks now participating in these events. Coming to live events discussing on the forum we have for executives to discuss all these topics. So would love to have you there. That's squirrelsquadron.com is where you can find out more about that. We have some very good events coming up soon. For example, next week we're doing a zoom call on how to get your engineers to actually talk to customers. And somebody was on the forum today saying, maybe you don't want your customers, your engineers talking to customers. Some of them were kind of grizzly. And I said, well, there's some ways that you can really make that successful for you. So we'll talk about that next week. And I'm live in London, the 8th of March, talking about elephant Carpaccio, which is a way to slice your work into such thin pieces that you can see through them. That's what Carpaccio means. But even more you can see them because they go live. So if you want new features in your software every day come along in London or sign up for the recording, you can do all that@squirrelsquadron.com. Let's see, so we are going to be talking about disasters today, cuz that's something Chris is a super expert on. And we can start with a disaster. I apologize. I think some of you may have got multiple copies of today's reminder email 15 minutes before we weren't expecting you to come four times. It was a mistake in our CRM system. It's the kind of thing that happens sometimes, and we can talk about that as an example, disaster if we want to. But let me introduce Chris first. Chris is the co-author of a book called Meltdown. That's how I know him because I read the book and it was a sequel to one of my favorite books called Normal Accidents. And really brings it up to date and talks a lot about how you can learn from disasters. But Chris, there's lots more you do. And I'll let you say a bit more about your background, what you're doing today, and why you love disasters.

Chris Clearfield (02:46):

Yes. That made me curious. Do I actually love disasters?

Douglas Squirrel (02:50):

Well, I don't know. You could say you don't. I do. That's what I would say.

Chris Clearfield (02:53):

I know I love them for the learning, right? Because they show us the kind of edges of our understanding of, of so many things. So as you said, I co-wrote this book called Meltdown, which I happen to have here on my desk. It's about essentially the way that the world is getting more complex and how we need a different set of tools to, not just tools, but a real different stance to manage that, that complexity. And the book came out in 2018, which was, you know, a little while ago. We've had some major world events since then including a pandemic which to me, kind of keep emphasizing the interconnectedness that we all live within, kind of whether we know it or not, and whether we want to or not. And, you know, since the book came out, what I've really seen is leaders who reach out to me because they have started to recognize that they have a problem that they can't solve, kind of maybe within their system, they can't solve with their traditional approaches, tools, and methods. And so that's what I do these days. I work with leaders and leadership teams who are solving problems that require really change to be successful, that require their organizations to change themselves, to change their teams to change. And in that work, one of the things I really lean on as a kind of personal stance and a personal value is well there's a bunch, but the value of openness, the value of sharing the value of vulnerability and, and trust. But I think at the root of a lot of that is curiosity. The value of curiosity. So I think that to solve the kinds of problems that we need to solve today for many of them, curiosity is really the key.

Douglas Squirrel (04:49):

Well, fantastic. And those of you who know me well, and Chris does know that curiosity is one of my favorite characteristics and I'm always being curious. That's why I do these events because Chris and I are going to have a wonderful conversation and you guys are going to listen in and ask questions, which is going to be fantastic. So yeah we're going to have a blast talking about these topics. But what I really want is for you guys to ask questions and be involved. So find the chat wherever it is. Unless you're on the recording, that if you're logged with us, please find the chat. And in there, what I'd like you to do to start us off and to give us some good examples is to tell me your favorite disaster. What's the one you're most interested in that you think you could learn the most from that you are find the most entertaining or painful or something else? And tell us a little bit about that and we can apply some of the techniques, apply some of the tools that Chris uses to those disasters. And I'm going to have Chris and I go first. So Chris, you can start thinking what's your favorite or most interesting disaster? But I also want to tell people, please come in with questions and comments and disagreements. So you can see Chris and I might disagree about whether we love disasters, we're going to disagree about a lot of other stuff, and we'd like it if you have comments and questions. That's what makes this most interesting and exciting. Someone's already given us one, which is my marriage. We can, we can delve into that one. He couldn't resist or she couldn't resist. Fantastic. so tell us a little bit more, what's the disaster that's most painful for you?

Douglas Squirrel (06:15):

Let me start. And the one that actually I find most interesting and fun, we'll start with a fun example is one I don't know if Chris knows about it's called the friendly floaties spill. And this is where a number of different rubber duckies and other fun toys that you might have in the bath we're in a container on a ship and they fell off the ship and that was the disaster. But the result was that the container opened up in the, in the water and all these turtles and birds and I don't know, these little things that you play with in the bath, they all went into the ocean. And there's an oceanographer who've spent the last 25 years or so mapping the currents because if you find a particular one of these in a place, he knows where it started and he knows when you found it. And he can map the current and figure out how it got there. So it's actually added a lot to our knowledge about ocean currents by this particular disaster occurring of a container falling off a ship. So that's my favorite. Chris what's your favorite?

Chris Clearfield (07:16):

Well, I didn't know about that one. And what a delightful and, and lighthearted disaster, and this is part of my conflict with loving disasters, right? Because so often there are real, I mean, tragic human, human consequences to them. Let's see, if I think about you know, what a disaster from that we write about in the book which there was some question of whether it would have to get pulled from the UK edition. But we ultimately got backing and we were pretty well sourced and how we covered it, but was the Royal Mail had a I think that's the right, the right organization. The post office, the UK post office had a kind of combination of accounting, like sort of systems errors and a very kind of, I'll say sort of draconian approach to them. So they essentially had databases that were dropping transactions, databases that weren't properly reconciling transactions. They had connectivity problems. And this would affect, you know, people in small villages, right? Who ran a corner store and there was a post office kind of inside of it. And these were folks that were really hubs of their community. And a not insignificant number of them were prosecuted and fined. Someone went to prison essentially because, you know, if the post office service lost connection in the middle of a transaction, there were all sorts of kind of complexities that led to this. Then at the end of the day, the post office was basically saying, well, you're stealing from us and had no ability to introinspect and look at their own system. So, I mean, to me that's kind of something that really brings it that's coming to my mind right now.

Douglas Squirrel (09:31):

Well, I like that one. Not because of course of the bad outcome for people, right? But because it's a beautiful example of the both complexity and tight coupling that tend to lead to disasters. And you and I know that term, those terms very well, because we've read, you wrote literally wrote the book on them. But can you tell our viewers a little bit about maybe with that example or another one that you'd like to use what are those two elements that, that tend to combine together to make sort of perfect disasters?

Chris Clearfield (10:02):

Yeah. Well, you know, you kind of talked about meltdown as a sequel to a book called Normal Accidents, which it sort of is, kind of a philosophical kind of carrying on of that. And Normal accidents was written by a guy called Charles Perot, who was a sociologist who for sort of a series of coincidences was brought in as a party to the Three Mile Island investigation. So this is the big nuclear meltdown.

Douglas Squirrel (10:31):

I was living in Pittsburgh at the time, so I remember not drinking milk for a while. Keep going.

Chris Clearfield (10:35):

Yeah, yeah. So Perot looked at this accident and the, the people looking at it were basically concluded that the official commission concluded that this was operator error. That the operators didn't do the right thing at the right time. And that was what causes a meltdown.

Douglas Squirrel (10:53):

They pushed the wrong button, that's why it melted down. It's all there.

Chris Clearfield (10:56):

Exactly. And that's really a classic trope, and I think sort of something that I think we can use curiosity to, to get around, but, but I'm sure we'll get to that in a little bit. What Pero did was Perot looked at this, this accident and he said, look, the logic of the accident wasn't even understood until months and months later. And months and months later, you know, after all this investigation happened, you had somewhat of a clear understanding of, of the accident. But what Perot did was he looked at this and he said, it's a real cheap shot to blame the operators. He said, what the real thing at fault here is the system and it's at fault because it's complex. Which is a term that we use in the book to mean many different things. And he uses to mean many different things. It's opaque, there's a lot of interconnections. It's hard to see what's going on in the system. You know, in a nuclear power plant also in, you know technology systems, you can't like send somebody in to see, to kind of check the ground truth. You have to rely on all these indirect indicators whether that's pressure and temperature or log files or even kind of, you know, reports on marketing and email opens. So when a system is complex, you have this greater chance of these sort of unexpected interactions. And within those unexpected interactions, you can get things that are are challenging and, and emerge and you kind of don't understand. So that's complexity. And then you add to that tight coupling, which is this term borrowed from engineering, which basically just means there's not a lot of slack in the system, there's not a lot of buffer in the system, there's not a lot of space for someone to kind of intervene, understand what's going on and make changes. And what I think is interesting about tight coupling is it's kind of about the time scale of the system. So, you know, we can talk all the way from a trading system, a high frequency trading system, that was kind of my first career where you might have milliseconds to where orders are going out. And so that's the time scale that obviously humans can't think about. But then you can also have tight coupling on the scale of months because of institutional inertia and because of how long change takes to happen. And so, you know, we have these kind of slow moving disasters. I think about the US political system as one of them where.

Douglas Squirrel (13:26):

We do it in the UK too.

Chris Clearfield (13:27):

Yeah, okay. Where even though you can sort of think faster than the system, you actually create change faster than the system. So, so when you have this, when you have a complex system that's also tightly coupled, it's more likely to lead to these kind of, I mean, to meltdowns, right? To these sort of big unexpected disasters that are, you know, I like to think about it. They're kind of orders of magnitude worse than the typical thing that happens in, in, in the, in the organization, in a system on a day-to-day basis.

Douglas Squirrel (14:01):

There you go. And, and the key thing is, and the key insight from both your book and Perots is it's almost certainly not the operator error, that the system was designed in such a way that these things are almost unavoidable, right? And it's the system failure rather than a human. The human happens to be part of the system and happens to be the one who is approximate cause and pushed the button that caused the meltdown, right? But the missing error message that should have appeared above it. And I remember there's one in Perrow's book where somebody, he hung a hat over the over the switch and that covered up the warning light that there's just all these different pieces that go in to cause the problem. And, you know, their warning light might have been covered up by a hat, but there was no warning signal, there was no automatic switch off. There was no...

Chris Clearfield (14:47):

And there was no, there was no coat rack, right?

Douglas Squirrel (14:50):

Exactly. So that's where the learning can come. And I hope we can talk a lot about that. How when you have a disaster, you can learn a lot from it. And I did a lot a livestream a few months ago where I talked about Netflix's code chaos Monkey, I don't know, do you know about the chaos? Monkeys. So just in case viewers don't know, Netflix actually goes and creates many disasters on purpose. They go and take out a server or a data center or a whole region on purpose with their live system, with their actual Netflix service that's servicing movies to, to customers. And they do that on purpose because they'd rather have it under controlled environment where they can learn and they can say, Hey, wait a minute, we need a coat rack here. They'd rather get that because they value the disasters so highly, they want to create them, but in a controlled way so that they can learn. So I just think that's really valuable. I want to pick up on something. My old friend Steven Halladay, hi Steven has put in the chat he mentions one, and I think I sent this one to you, Chris, I periodically, when I come across great disasters, yes, they're really interesting. I forward them to Chris. So he's put in one called the crash of Air Stan Flight 1388. And this is the one, Chris, you remember where the control surfaces were the wrong way round.

Chris Clearfield (16:06):

The Aon cables?

Douglas Squirrel (16:07):

Yeah, exactly. There's so many of these good ones from air disasters because they're so well documented. Yes. So they're so well investigated. They tell us a lot. And there's a lot of public records about them. So the brief version of this one, and I encourage folks to read it Stevens put the link in the chat. The error, the proximate error was that somebody put the cables on backwards. And so when you wanted to make the plane go down, you were making the plane go up and we wanted to make the plane go up, you were making it go down and this messed up turns and everything else. And the, the poor pilots were trying to reverse engineer what had gone wrong while they're trying to fly the plane and not fly it into the ocean. And they did. And they did, they managed to slam the thing, which is incredible. So this is sort of a disaster, but it has all the elements we were just talking about that there was tight coupling there was a very short timeframe in which the airplane was being repaired. So there wasn't much time for anybody to notice there was a problem. It was extremely complex. The cables were in intentionally installed in a way that they controlled two things, if I remember right. We could read the details, but the cable was kind of doing two things and you didn't understand which one was which. And so they did it the wrong way around in a very understandable way. It wasn't that they just weren't paying attention and, and were reading a book while they were doing, it's like they were paying very close attention, doing exactly what the instructions said, and the instructions were wrong in an unclear, in an important way. And all those things added up together to make these poor pilots trying to fly this cargo plane have a completely uncontrollable aircraft. So Steven, thank you very much for that excellent example. It's another good illustration of tight coupling and complexity.

Douglas Squirrel (17:52):

Chris, you still there? I'm just checking whether Chris is frozen. Oh, Chris is there. He's just thinking carefully. He's, he's analyzing. So I want to come to one that matters to a lot of folks who listen to me, which is a technical area. And, and Chris, I don't know if you have seen this but there's a trend in software and I've, I've literally been talking to a couple clients this week about it to build something called microservices. Have you seen this trend? Okay. And the wonderful thing about microservices is that it's got all this wonderful flexibility. It has all these options. It gives you all these possibilities to have many small pieces of your software, which all do different things and are carefully interconnected and are doing a lot of complex choreography to work together. And guess what? That's a complex system. And guess what? If it's trying to respond to users and do things in real time, it's tightly coupled because if this service starts sending a lot of rubbish messages to this other service, it's going to start doing stuff and then it's going to start sending, and that might be how we got four emails to everybody who was coming to this call. I don't know if that's what happened, but it's the sort of thing that happens when you're running microservices badly. And a lot of my clients have got really burnt by that approach when it's not done well. So I think that's one place where us software people can learn from many of these disasters. We should be very careful in making our systems extra complex in order to try to make them scale, try to make them more flexible.

Chris Clearfield (19:17):

Well, and let's talk about this, because I think this is where the stance and the paradigm really has to change, right? So I'm not a software engineer, but I like coding, right? Like, I like playing with it. My first job, I wrote code on Wall Street and I also looked at the risks of how our systems were connected. So, you know, this is like an area I'm friendly with, but not an expert in. But what it makes me think about is like, look, if I write a piece of code and it's in a monolithic application, like I can test it, I can even maybe debug it. I can sort of step through and see what happens. And once you, you know, take that that hole and decompose it into a bunch of different things, you really lose that ability to see the hole and you lose that ability to, to understand all the pathways through it, right? That's the complexity. And so what do we need to do when we're managing a complex system? Well, we really need to change our paradigm. So we have to kind of update our mental models from like, here's how I debug, or, you know, here's what I'm expecting my engineers to do. We have to update that too. Essentially what you were just talking about with the, the Netflix resilience engineering, which is we've got to build our system and then be willing to break it and be willing to, you know, not just test it in isolation, but test it in situ, test it as iit is functioning.

Douglas Squirrel (20:49):

Because we're expecting emergent behavior. Exactly. We're expecting things that we didn't program into the system. And we say, gosh, we never thought that if this was upside down, that this would cause this, and then we would be in this situation. And you can't find that exactly because of the complexity and you can't deal with it in, in the moment when you're not in a controlled environment because of the tight coupling.

Chris Clearfield (21:08):

Exactly. and well, I don't know. I think that might be, Oh that's what I was going to say. And I think that the, what makes it hard, it's hard to shift our paradigm, right? It is hard to say, what I have done for 20 years is wrong, or what I have understood before is wrong and there is a new regime I am operating in. And it's hard to do that from, you know, an individual perspective. It's hard to do it from a team perspective. It's hard to do it from an organizational perspective. It's hard to do it from a policy and a tooling perspective. And the people that are closest to this problem are the people that are the most technical, who often have the least structural power, right? So you've got this kind of mismatch between what the people who are actually doing the work are doing and know about the work and what the people who are leading think about the work. One way to talk about this is, I don't know if you've heard this phrase, work is imagined versus work is done.

Douglas Squirrel (22:13):

No. Say more about that.

Chris Clearfield (22:15):

Well, it's a term from safety and, you know, safety as a discipline is kind of going through a similar paradigm shift as, as as software in, in many ways. So, you know, the old model and actually the Air Asana example is a good example of this. It's like it's written in the manual, operators do it this way and kind of, and here's the output you get, right? So, so the old paradigm of safety is very kind of top-down procedural oriented. A supervisor writes a procedure, operators are expected to follow it. It's almost the kind of, you know, it's the extension of of Henry Ford or Frederick Winslow Taylor. It's this kind of you know, linear top-down leaders determine the actions and, and other people operators implement the actions. And the comment here is that, you know, work is done is different than work is imagined, and there's other work is planned, work is prescribed. There's a whole kind of series of sort of abstractions of work. But I think the easiest is work is imagined you as a leader imagine that this is how people do their work, here's how they actually do their work.

Douglas Squirrel (23:23):

And the map is not the territory.

Chris Clearfield (23:25):

The Map is not the territory. Exactly. And in how they do their work, what the thing to look at is that people are actually very clever about how they do their work, right? They have a bunch of constraints that they're operating in, and they do their best to satisfy the, the requirement that their work has done amidst all of those constraints. And this is the flip side of operator error, right? This is operator excellence in a sense. They are getting things done despite all the things that stand in the way of getting their job done. And what I think is interesting about this is when you start to look for, when you start to get curious about how is this work actually done, that's one of the ways you can really learn about your system. It's one of the ways you can build resilience in, it's one of the ways, you know, recognizing workarounds are kind of a signal that maybe your system isn't as well designed as you think. That's a really powerful source of insight. And then I think the other thing, and this goes back to learning and, and not blaming people, quite frankly, is that if you acknowledge that what people are doing is they are trying to do very complex work in a very complex system, and they're doing, they're showing up with positive intent and they're doing the best that they can. When something goes wrong, you don't blame them, but you kind of turn the lens around and look at yourself as a leader in the system. And that's hard to do.

Douglas Squirrel (24:45):

It certainly is. I want to pull in something from Steven. He's noting on the Asana example one of the problems was the tests weren't right. So there was a mechanism for checking. And that mechanism wasn't working. So they went and ran tests, and the tests said, green, this is good. This is what you should do. And, and green meant bad. It meant this isn't, this isn't okay. In that particular situation with the cables reinstalled and everything else, but the people who were performing the actions were being excellent operators. They were trying to repair the aircraft in the right way and following all the steps that the leaders had given them, the problem was they were in a situation with emergent behavior. Where the aircraft was not in the state it was supposed to be in, and therefore the tests were wrong. The feedback was wrong. You know, when the pilots went and did their, their checks before flying the aircraft they pushed on the control stick and something moved and they said, great. Something moved. That's right. They didn't realize it was going the wrong way. Yeah. So these are the sorts of things. Thank you very much, Steven. That we can learn from disasters. If we train our operators to be creative, to be innovative, to be thoughtful and don't blame them, we will get a lot better from them. But we may not get saluting and following orders and doing everything by the book. We will get better responsiveness, better resilience in the face of emergent behavior. So Chris, I wanted to ask you about another one that's in your book which has some different characteristics. And I think it leads in this, in this leadership direction because it's one in which there, there were, there was some malfeasance, there was some, some bad action. Yeah. And I wonder what you think about it. This is where Volkswagen faked its tests. So they, they went through it. If people remember, they set up their cars so that in only one situation would they have lower emissions. And that was the situation where they were being tested for emissions by the, the regulator. So somebody did something wrong there. And I wonder what your thoughts are. How does that match with our coupling complexity model?

Chris Clearfield (26:52):

Well, this was one of the emergent things that actually came from, you know, I mean, you know this because you've written a book. But writing a book really, I think at its best is an emergent process, right? Yeah, you have an inkling, you go explore that. You find stuff that confirms your inkling. You find stuff that disconfirms your inkling and you sort of build a kind of mental model of what you're writing about over time. And this is a whole part of the book that really emerged from the process of, of writing it, which is complex systems are also more susceptible to wrongdoing. They're more susceptible to fraud, to cheating, kind of for the same reasons that, that we, that we talked about in terms of these fundamental aspects of complexity. You know, it's hard to see what's going on in the system. There's a layer of abstraction there. There's a lot of moving pieces. There's a lot of people involved, right?

Douglas Squirrel (27:49):

It's easy to hide.

Chris Clearfield (27:50):

It's easy to hide. Exactly. And, and in fact, sometimes we build metrics that are themselves. And anybody who's worked with something like OKRs you know, kind of doing KPIs, like doing goal setting with a team will find that, I mean, this is back to the map is not the territory, right? You can, you can really over-index to these key performance indicators, these incentives that, that don't actually get you the behavior you want, but get you a again, this is work is imagined versus work is done and I'll go into the Volkswagen example, but the one that kind of, I think every leader should be taking to heart. Every I really cringe every time I'm working with somebody and, and they talk about wanting to use incentives to change behavior in the organization because incentives are a great tool for kind of optimizing known behavior, like optimizing fixed and known behavior. But they are very demotivating. We know that about incentives. They encourage people to be less creative and take less risk. And they also have huge unintended consequences or as I think it's what's his name? John Steerman from m i t calls them, they're not unintended consequences. They're just consequences you didn't think about, right? So it's like, so Wells Fargo is, I think the best example of this at its worst, where, you know, they wanted to...

Douglas Squirrel (29:09):

So tell us the story. What happened to Wells Fargo. Not everyone will know.

Chris Clearfield (29:11):

So Wells Fargo wanted to promote cross-selling.

Douglas Squirrel (29:14):

They're a bank in the US

Chris Clearfield (29:16):

Keep going. They're a big US bank. They have a, you know, relationship with a customer who has a checking account. Wells Fargo wants that customer to have a mortgage with them and a credit card and another checking account and you know, a savings account on all of these things.

Douglas Squirrel (29:29):

Noble incentives invented by good me, well-meaning executives in some office somewhere in the head office.

Chris Clearfield (29:36):

Right. Exactly. And so they pushed out targets for, for people, this is how many new accounts you need to open. The targets it turned out were like horribly aggressive and horribly realistic. And so the management, I think middle management in particular was really, they're kind of caught in between, right? So they have operators telling them, I can't do this. This is unrealistic. Kind of working up from the bottom and from the top they have this tremendous pressure. So they themselves put tremendous pressure on their tellers, on their, on their bankers to open all these accounts. And again, back to this kind of, this work is done, those people start opening accounts, but they often do it without the customer's knowledge, without the customer's consent, they start forging signatures. So the lines blur, it gets criminal pretty quickly.

Douglas Squirrel (30:32):

This is a system problem. I just want to exactly. Maybe criminal behavior. But there's a system problem both in the VW example in this one that somebody created these perverse incentives, this desire to do something wrong. And the not unintended consequence was some people did some stuff they definitely shouldn't have.

Chris Clearfield (30:51):

Right? And you know, the Wells Fargo example is particularly icky because not only were people getting fired for complaining about this for raising, you know, when a regional manager would come and visit, people would basically be whistleblowers and they would get fired and they would get blacklisted from the industry because there's kind of an industry blacklist for wrongdoing. It didn't happen to a huge number of people, but it doesn't have to happen to a huge number of people to really shut down the culture. And so I don't remember the, the magnitude of the find I'm guessing in the billions of dollars that Wells Fargo ended up paying. This is serious stuff, right? This is stuff that we really need to pay attention to. So back to the Volkswagen example. Because I think the Wells Fargo example is so clean. The, the Volkswagen example is, is really interesting. You know, Volkswagen had these, again, incredibly aggressive targets for what they wanted their their diesel cars to, to do in terms of performance, cost and and emissions, right? And there's a real engineering balance between these things. You can have high performance and low cost and high emissions but if you want high performance and low emissions, then the cost has to go up a little bit. And it's not a tremendous amount, but you have to put in a device with uric acid to extract some of the sulfites that come in from the emissions from the diesel fuel. And owners have to change that every six months or whatever it is. And it's $500 to, you know, install on the car and da da da. So Volkswagen and wanting to kind of basically ignore these engineering constraints cheated on their emissions test. And they did it in this very subtle way where they figured out if a car was on a test stand and if it was on a test stand, they ran the engine in a different mode that had a better looking emissions profile.

Douglas Squirrel (32:54):

And you couldn't have done that in 1960.

Chris Clearfield (32:56):

Exactly. Right

Douglas Squirrel (32:58):

Because we didn't have computers in cars that could detect this stuff. He would've had to build some crazy circuit and it would've been obvious.

Chris Clearfield (33:02):

Right? And this is what I think is so interesting. So in the US regulations, and I'm sure there's some, I think actually the language is equivalent at least in the English European regulations they call this a defeat device, right? It's a device designed to defeat emissions. And when I think about defeat device, it sounds like something that is bolted onto the engine, right? That would be just screamingly obvious if you installed the defeat device on your car.

Douglas Squirrel (33:28):

Which is probably what was happening in 1960 when their regulations were written.

Chris Clearfield (33:32):

Right? Exactly. That was what the regulations were designed to avoid. But with Volkswagen, it wasn't even code, it was actually just, I mean, it was a little bit of code, but mostly it was just they were switching out engine parameters depending on the, on the environment. And it's fascinating. It's totally fascinating to me. So it's interesting, right? On whom was it incumbent to not do this, right? I mean, there's obviously engineers involved, there's non-engineers. I think Bosch was I could be wrong so don't quote me on that, but I think Bosch was actually the kind of subcontractor who sort of designed the engine parameters. So you've got all sorts of organizational complexities, and we write about this in the book about how overbearing the chief of Volkswagen was and how, you know, that that kind of top-down leadership behavior really cascades throughout the organization. And to me, that's the real lesson. Like the more constraints you add on your people and one of the constraints is, you know, not creating an environment where they can speak up about what they are actually seeing, then the more like you likely you are not only to get these really bad outcomes, but also just to get worse performance across the board all the time.

Douglas Squirrel (34:55):

There you go. Well, this reminds me of a story from my early consulting days, which is a software example, very similar to this, but we didn't have an overbearing executive. What we had was a kind of absent executive, so had the other direction weren't perverse incentives. But what happened was that a group kind of went off on its own and did crazy stuff that made no sense and wasted a ton of money. What happened was it was an e-commerce company that sold items in their case books on, on the internet. And their main way of selling at that time was Facebook. Facebook ads are still very effective, but they were super effective then. It was the dawn of Facebook. And so this group was really empowered. They had a really clear mission. They had marketing people and engineers together, and they said, we are going to build the greatest Facebook ad engine you've ever seen. We're going to be able to bid on keywords and people and so on. Down to the penny, we're going to do it really precisely. We're going to do it really, really well. Well, they did it so well that they crashed Facebook servers, which you can imagine for a startup is a pretty impressive feat of over-engineering. Because there was no need to get that much data to do that much trading, to do that much bidding for what from Facebook's point of view would be a relatively small vendor. And so we've got in all kinds of trouble with Facebook wasted a ton of money on, on over-building. But the problem there was different from the, the Volkswagen or, or Wells Fargo examples. This group did the wrong thing. Luckily it wasn't criminal, it was just really annoying to Facebook and had to be very nice to, to the Facebook account manager. But they, they did it not because they were being they were given too much pressure, but not enough. They weren't clearly being guided that you know, this far and no further we need this much efficiency and, and not anymore.

Chris Clearfield (36:58):

Yeah. And you know, one of the things we're playing with is about how constraints are resolved or, or, or not, how constraints are not imposed. And one of the things I often think about as I'm working with somebody who's, you know, facing a challenge in their organization, trying to solve a problem that is kind of unsolvable, given how they do things. Now, one of the things I often look for is, well, what's the level? Are constraints being resolved at the right? Are constraints? Are compromises being resolved at the right level, is how I want to put it. And when you go back to the kind of the, the work as imagined versus work is done sort of framing of things. You know, you want constraints to be resolved at the lowest level possible, but no lower. So if you're telling engineers you need to produce a lot of code and you need to test thoroughly, and you are not giving them the time to do that, again, we've kind of got this three-legged stool, right? And you're not giving them the time or the resources to do that, then they're going to resolve that themselves, right?

Douglas Squirrel (38:14):

Which is the wrong level to do it at.

Chris Clearfield (38:16):

Which is the wrong level to do that.

Douglas Squirrel (38:17):

That strategic level where you're putting in the constraints.

Chris Clearfield (38:20):

Exactly. So it's like at the strategic level, if you are an organization that values testing, you need to put resources behind testing. You can, you, you need to be in charge of making those trade-offs essentially. And I'm a big believer that there's lots of problems where we get into a trade off mindset that don't need to be solved with trade offs. But I'm also a believer that a lot of business is about making compromises and, and those compromises need to be li made on a strategic leadership level for it to be really effective. So if you want people to move fast, if you want them to test their code and write very thorough code and you want them to do it cheaply, that doesn't work, right? You've got to flex one of those things. And the best place to flex that is on the leadership levels to decide, well, here's the ROI of testing. Here's how much we're willing to invest, or here's the ROI of moving fast. Here's how much we're willing to invest. And then ultimately you get those constraints, those compromises resolved at the right level. But it's a real mistake to push them down.

Douglas Squirrel (39:29):

Indeed. So let me just pull a couple ideas together and again, prompt folks to ask questions, argue with us. Very interesting. We've kind of set out the model of if you have a complex and tightly coupled system like say microservices or a spaceship or a dam or something like that then you're kind of headed for disaster. You should expect disasters. You might want to create some, like Netflix does, so you can have them in a controlled environment and learn from them. And what you want is to get the decisions made at the right level. The people and

Douglas Squirrel

Organization willing and able to bubble information up and get bubble information down. The, the communication channels are very important. And if you don't do that kind of thing, you wind up either in criminality like Wells Fargo and VW or crashing Facebook servers. So actions that people are who are listening to us might want to take or might want to argue with would be things like can, can you look for complexity and tight coupling in your system? Can you reduce them? Can you mitigate them? Can you test for them? Could you create the many disasters like Netflix? And is there the psychological safety that your team needs in order to be able to tell you about problems? And that's where I wanted to suggest you might go next Chris, because I know one of the key themes in Meltdown that I really liked was the importance of diversity. The idea that if you have just one type of person in a bubble thinking about a problem, you will get much worse solutions. And you can imagine those folks at home office in Wells Fargo saying, yeah, more mortgages sounds great, let's get more mortgages. They, they didn't have the connection to the ground level to understand what was actually happening in, say, Seattle where a branch manager knew that was going to cause a problem, right? But the, there wasn't the information filled, there wasn't diversity in the boardroom. So could you talk a bit about that? Why is diversity valuable?

Chris Clearfield (41:16):

Well, it's valuable on a bunch of different levels, right? I mean, one, there's just a social equity component to diversity that's obviously really important. But it's also valuable. And when you think about diversity, I mean, you kind of, you sort of teed this up a little bit, but, but both kinds of diversity matter, right? So surface level diversity, what people look like actually matters, right? We humans are group and social creatures. And so when we are in a group where everybody looks like us and everybody dresses like us and acts like us, we are more likely to give others the benefit of the doubt. That's just, that's, there's really good social science research that's very well, well grounded. And then we've got kind of diversity and discipline, diversity and background, which creates the same thing, right? And the effect might be a little bit different, but, but basically if you're an engineer with a bunch of engineers, or if you're a business person with a bunch of business people, you kind of tend to give others the benefit of the doubt. You kind of tend to think, well that doesn't make sense to me, but I'm sure they know what they're talking about. But when you take people that have different approaches, different personalities look different, and you put them in your leadership team, you put them in your group, what happens is they're more likely to say, hang on a sec, I really don't get this. I don't understand this. And one of the examples we've written about, and I bet a lot of people here have read bad blood about Theranos the Elizabeth Holmes company, again, company turned fraud.

Chris Clearfield (42:59):

Medical, you know, medical testing, blood testing company. When you look at their board, I mean, it was a bunch of old white men, almost none of whom had any kind of grounding in science, right? Any kind of science background. I think there's one physician on the board, but they'd been out to practice for a long, long time. And so you, you take this group of prestigious people and you know, they are less likely to question, to challenge, to admit that they don't understand something. And I think that's all too easy a dynamic to show up, you know, kind of all over, all over the business world. And what, what you find actually is that in, for simple problems, homogeneous teams work faster, right? But I don't know anybody who works with simple problems, like in in, in the real world. I don't think there's any simple problems left when you get to a complex problem. Homogeneous teams take, I'm sorry, diverse teams take longer, but they actually get to a better outcome. And again, this is just very well grounded in lots of different kinds of social science research from you know, field studies to experiments to, to looking at, at data across the board.

Douglas Squirrel (44:16):

And this is a significant problem for us in engineering because we're a bunch of white men, I'm sorry to say. And there's so many people who look like you and me doing software development. And I wish I had a solution to that problem. I do everything I can to encourage people who don't look like you and me to go into software engineering. There's a problem there that there's beyond my competence to solve in terms of supply. But one thing that our listeners and viewers could do is to be on the demand side. You should be looking for people who don't look like everybody else, who have different experiences, who have different ways of thinking, who have, one of my favorites is when somebody comes to me and says, you know, Squirel, I found this great person. They're going to be a product manager, or an engineer, or a tester or something. But you know, they studied history at university and, and then they kind of got into it in this funny way because they worked in their family business and then they went and ran a winery for a while and then they came back to software and, you know, should I look at this person? They look really weird. And I say, grab them with both hands. Don't let them out of the building. Totally. Somebody who comes from a different background just brings so many other ideas, so many different ways of thinking about the problem. Whereas someone who's just been writing code every day since they were six, that was me until I was about 20 that those people come with a lot of blinders on and can't bring you the kinds of creative thinking you need when you're working in a complex system that has emergent behavior and that's tightly coupled. So you have to act quickly.

Chris Clearfield (45:38):

And it takes both, right? It takes the people that are coming from that diverse background and, and it takes the people that have, you know, lived and breathed software for, for decades and are kind of yeah. And, and I think, think the point is that, you know, sometimes the hole is worse than the sum of the parts, right? So we really want to be intentional and thoughtful about how we form our teams, how we form our groups, and then also, and this kind of ties into your, your psychological safety which I think is you know, we could talk for a whole hour plus about that.

Douglas Squirrel (46:22):

I did a live stream about it some months ago. Go look it up. Keep going.

Chris Clearfield (46:28):

Oh, that made me lose my train of thought.

Douglas Squirrel (46:30):

I'm so sorry,

Chris Clearfield (46:30):

Chris. No, no, look, I got it. I got it. You know, I, think one of the things I'm often helping leaders do is pay attention to the process. In almost every organization I work with, leaders come on a path from being technically excellent. So they are good at solving technical problems. They, they solve bigger and big old technical problems. And then at some point it's not totally true because you sometimes have a principle engineer kind of thing. But, but at some point, engineers, because they have been good at solving problems, switch into a leadership role and lead bigger and bigger teams. And some of them get a lot of support doing that, and some of them don't. And even the ones that get a lot of support doing that, I think there's often, you know, if I think about the world and divide it into content, which is like the subject matter that you're working on and process, which is how you're doing that work, how you're paying attention to it, how you're running your meetings, how you're having your conversations, are you being curious about your impact on others?

Chris Clearfield (47:33):

And the bigger your responsibility gets, the bigger your teams get, the more you need to pay attention to process. The more that becomes a really important part of your job and process can create psychological safety, it can also destroy psychological safety. So can I just throw out a, a very simple process thing that, that I think is just real low-hanging fruit for so many leaders out there.

Douglas Squirrel (47:57):

Tell us please.

Chris Clearfield (47:58):

So a lot of leaders will be in a meeting with their team and they'll say something like, here's what I think we should do. What do you all think? Right? So right away, I just want to slow that down, right? Because right away they've weighed in on their opinion. And in many cultures, the prevalent norm is to support the leader's opinion, right? Which is a really useful behavior, right? But that can get a very different outcome than just a question like, Hey, I'm not sure what to do here. It seems like we're balancing, you know, A and b I don't know what to do. How should we think about this? What do you think? Right? So kind of turning it to the group in an open-ended way, that's a very...

Douglas Squirrel (48:36):

I would say that's building a framework for them. So the, the A and B you described gives them a way to say, wait a minute. Well, I like a lot of A, but there's some of B and actually I've thought about C. Whereas if you say we should do a, then people try to think of reasons why A as a good idea.

Chris Clearfield (48:51):

Exactly. That's exactly it. And so that's kind of the framing. I like the framework piece of that. But then there's also, so that's part of process, that's part of structure. But then there's also just how do you talk about it, right? So, hey, I'm considering A and B, why doesn't everybody just take a minute to think about, to think about these two options, or five minutes to think about these two options? And then, you know, what, pair off with the person next to you or, or go to breakout rooms in Zoom and sort of chat for 5-10 minutes about, you know, what are you seeing, what do you think the advantages are, whatever. So now you've got people talking in pairs. Now depending on the size of your team, get into a group of four, right? Again, super easy, you turn your chairs around. If you're in person, if you're on Zoom, you bring people back, you put them in a new breakout room, now you're in a group of four. So what you're doing is you're creating structures that make it safe for people to share and ideate and hash on a problem before they come back and say, I think these are terrible ideas. We should actually do C and there's a power in kind of like creating, making it a crowd process instead of an individual process. I made a video a couple weeks ago I can't remember exactly what we told, but it was basically like, stop asking your people to be brave, right? You can use structures to make it so bravery isn't what you, what's required. You can make it so it's way easier for people to speak up. And that's part of psychological safety, but it's such a tangible thing that it's one of the things I kind of am preaching all the time with the people I'm, I'm, I'm consulting with and coaching with.

Douglas Squirrel (50:30):

Well, me too, for sure. So I don't see any questions in the chat. I see some good comments, which I appreciate. And hi to the folks who are saying hi. If you have questions, this would be the time to throw them in because we're, we're just about to finish. But Chris has given us all kinds of fantastic ideas. Let me see if I can summarize what I've heard. We have the framework of complex and tightly coupled systems and almost all our software systems have a lot of those characteristics. So we're going to be dealing with systems that are likely to have disasters. Disasters are things you can learn a lot from. There are a whole variety of them. We've talked about a lot of different ones, each of which tells us about the sorts of ways that disasters happen. And if we were to work to understand them, even experiment with them in controlled environments, we'd learn a lot. So if you're experiencing disasters and problems, I had somebody tell me that their system had thousands of bugs logged in their bug tracker that tells me that something's going wrong. So you could study those, you could make that an object of study and learning rather than simply a reason to ring your hands. There's opportunities for malfeasance and doing bad things. Those tend, again, not to be really operator error. There's clearly culpability there, but the system has created those opportunities. And you could imagine developers I was doing a due diligence a few weeks ago and we found all kinds of security information in the regularly available source control. So there were keys to the servers and ways to log into the database and so on that anybody could get hold of. And that wasn't because the engineers were bad, terrible people. They were taking shortcuts because of the incentives that they were getting. And so that's what we told the the client to do is, is fix that rather than go fire the people. And the final one is that diversity is really valuable because if you can get more people thinking about a complex problem, they're going to bring a lot of different perspectives. And over and over again, both Chris and I have seen the value of people with very unusual backgrounds, very different ways of looking at the system working together and coming up with, Hey, wait a minute, we don't need A, B or C, we need Zed. We need Q, we need teddy bear, right? We need something really outside the box. And you get a lot of wonderful ideas that way. So if you're not creating opportunities for your engineers to do that, do that. I will put in a plug for the session I'm doing next week on getting your engineers to talk to customers. Because if you get those two groups talking, you talk about more diver, you can hardly be more diverse than right who use your software and the people who write it. But when you get them talking together, you get really wonderful results. And we'll talk more about how to do that. But Chris, I want to let you talk about how people can get in touch with you. How can certainly read Meltdown, that's one of the first things to do. But if they want to hear more about you, they want to watch the video you just talked about where's the best place to start to find more of Chris?

Chris Clearfield (53:17):

Yeah, there's sort of two places. I mean the best place and the place where I put everything first and kind of is everything is my mailing list. So if you go to clearfieldgroup.com you can download a guide called Three Mistakes Leaders Make When They're Leading Change, which is a, a lot of what I focus on is helping people create change. So clearfieldgroup.com, you can book a call with me, you can download that guide and, and once you've joined my mailing list, you're kind of, you're sort of in the fold and then you, you know, you get everything. The videos, the newsletter that I publish. Also tend to be pretty active on LinkedIn and kind of try to join conversations. There post a lot there. So that's it.

Douglas Squirrel (54:02):

Is your LinkedIn on clearfieldgroup.com?

Chris Clearfield (54:04):

Gosh, I don't know if it is, but I think if you search it should be. So I will note that. But if you search for me, if you search for Chris Clearfield on LinkedIn, you should, you should get that.

Douglas Squirrel (54:13):

There aren't a lot of Chris Clearfields and certainly not as many who are obsessed with disasters.

Chris Clearfield (54:18):

Right? Exactly, exactly.

Douglas Squirrel (54:20):

You should be able to find the right one pretty quickly. Well, that's good. So I hope folks who are listening and watching will, will do that and get in touch with Chris if you're interested in my stuff. Would be the best place to learn more about why we're doing this. What how to do more is the squirrel squadron.com. So head on over there if you want to sign up for more events like this, be on the forum, talk to me. Those are the best places to do that because we have a whole series of very interesting things coming up. And Chris is just one in a series of fantastic folks. I've really enjoyed speaking to you, Chris. Chris, thanks so much for coming.

Chris Clearfield (54:52):

Yeah, thanks Squirrel. You're, you're welcome. I've appreciated the opportunity and I see Steven has a question here in the chat.

Douglas Squirrel (54:58):

We got a last minute question. We won't redo all the end stuff, but let's try to see if we can squeeze in an answer to Steven. It's a little long to put on the screen, so we'll read it out. Do you agree with encouraging data-driven decision making? Often things go wrong based on human made assumptions. When I think about the examples of the climbers passed by on Everest when people assumed they were dead, that's a classic. So Chris, I wonder what you think about that.

Chris Clearfield (55:25):

So I think I have mixed feelings about data-driven decision making, which might be, might be a little, a little taboo. I think what people often don't realize is that the data they're collecting is shaped by the system that they've built and the system that they're in. And so it can be really dangerous, right? Because it can lead you to the wrong conclusions. And, and data can also be a real source of expressing resistance. I was working with a team who they were working on the, I can't go into too much detail, but they basically worked on big industrial sites and they were creating change in these big industrial sites and they had great data that showed that the way we're doing things is wrong. We needed to do them in a different way, but they would go to leaders and show the data to them, and the leaders would be like, well, what if the wind were from the south southwest on a Tuesday?

Chris Clearfield (56:16):

Could you run all the Tuesdays with that kind of wind? And they would go back and, you know, take them a week and they would run it and they would say, still says the same thing. And, and, you know, kind of over and over. So, so data can be a real source of resistance is, is the first thing I'll say. And it's, it's really already folded into our system. So the most powerful thing you can do is change the context and change your system. And I don't think you can, data is maybe the start of that, but it's not, not the whole story there. So that's one thing I'll say.

Douglas Squirrel (56:46):

I would agree with you, Chris. Go ahead.

Chris Clearfield (56:48):

Keep going. And then the other thing I'll say is Steven, I love what you said about assumptions and I think I mean Squirrel, a lot of your book is about testing assumptions, about testing assumptions about what our own mental models are, about what our team's mental models are, and that's where I think it's really, really powerful. But I don't think that's usually what people mean when they talk about data-driven decision making. I think we can test assumptions, and in fact, I'm doing this on a project right now. I'm helping a leader of a big law firm restructure their leadership team. And we have all these ideas about what, what is good about the current structure, you know, what isn't good about the current structure. And part of what we're doing is talking with people and testing those ideas, testing those assumptions. And that to me is really powerful because then we get to see kind of the organization as it is, rather than the organization as we wish it were. And, and seeing things as it is, I think if that's, even if that's all you do, that is a tremendously powerful, powerful thing. And so that to me is where, that's not what most people mean when they talk about data, but I think that's a really powerful way to think about, well, what is actually happening here.

Douglas Squirrel (57:59):

Indeed. And I'll say two things about that. One is I'm stealing it from my co-author, Jeffrey. Even your eyes don't actually see what's out there. They're busy doing cades and moving around and, and doing stuff that you're not aware of. It's an illusion that you're seeing the two of us on your screen we're those photons are not necessarily hitting your eyes the way you think they are. So actually seeing it the way you think it is, is much more slippery thing than you think. Yes. And the other I'll say is that one of the problems with data driven decision making is which data? And I remember being at an e-commerce company, I've worked with many clients since then in the same situation where in, in your case where somebody brought some data and said this is what we should do, somebody else would pull out another system and another piece of data and say, well, this doesn't match what you're saying. So you can have data wars if that's how you choose to fight your battles. And the problem is that you're, you have the battles. The problem is that you have, you're, you're not doing enough experiments to learn what's actually happening, for example. And that's what we wound up doing. We discovered all kinds of things that didn't match any of our data. So I hope that's helpful, Steven. Steven said that was a great discussion. So I hope hope we've answered Steven's question, Chris, we're over time. But it's been an absolute blast. Thank you so much for being here and look forward to many more fun chats in the future.

Chris Clearfield (59:15):

Yeah, thank you Squirrel. Lovely to reconnect and it's nice to see you again after all these time.

Douglas Squirrel (59:20):

Thanks so much. Thanks everybody. We'll see you again next week and at more events in the future. Thanks. Take care. Bye now.

Squirrel Squadron Substack

Learning From Disasters

Livestream & Transcript