Definitely, Maybe Agile

How business agility fails if we don't have resiliency

Peter Maddison and Dave Sharrock Season 1 Episode 53

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 13:49

 Learning from failure is an essential part of building your resilience. In this week's episode, Peter and Dave discuss how failure gives us the ability to improve and continually create more resilient systems.

This week's takeaways:

  • The importance of understanding the resiliency of each part of our system in context of the whole system.
  • Failure allows us to learn, and then we can rebuild stronger.
  • Recognize that we need to persevere through failures to gain the benefits.


We love to hear feedback! If you have questions, would like to propose a topic, or even join us for a conversation, contact us here: feedback@definitelymaybeagile.com 

Peter

Welcome to Definitely Maybe Agile, a podcast where Peter Maddison and Dave Sharrock discuss the complexities of adopting new ways of working at scale. Hello and welcome to another exciting episode of Definitely May Be Agile with your hosts Peter Madison and David Shark.

Dave

I was just gonna say, Peter, I can tell this is a topic that you are interested in. You're passionate about, in fact, right? So what is it we're talking about today?

Peter

We're talking about resiliency and redundancy and all these wonderful good words starting with letter R.

Dave

And specifically, I think we we had a podcast a few weeks back when we talked about the value of resilience and building resilience into the system. So what is our our focus today is is not so much on the value created by it as why we lose money when things go go awry.

Peter

Yeah, and and how business agility fails if we don't have resiliency. And and there's lots of lots of examples of this. I mean, there's uh some common ones include things like uh you you you have you put a bunch of automation into place, you start to remove like the toil from the environment, and then the first time that something goes wrong and the automation is blamed for the cause, uh everybody wants to rip it all out and go back to the bad old days of doing everything manually. And this this removes and um this causes a lot of slowdown in the system. So if you don't um bring along this understanding that failure is something that we learn from, then failure is what gives us the ability to continually improve, then quite often uh a lot of the really good stuff you've done to start to build out your resilient systems um starts to in fact cause you pain instead.

Dave

Well, and but it's uh it's what you're describing there is actually it's part of the process. If you're moving, if you're if you're building resilience into your system, and and the the thought that sprung to mind as you were describing that is um test automation. If you do test automation for the first time, I'll often talk to teams to say, expect to throw all the tests you write in the next three to six months away. Because you won't write the right ones. You'll learn as you go, you'll realize how to structure your test suite, your code base, and so on, so that you can get the feedback more quickly and more easily. But you don't know that right at the outset. And yet there are so many teams that we'll talk to who said, Yeah, we tried test automation, it didn't work for us because we had this release and we ended up manually testing. So they go back to the old way of doing things, thinking that that automation that they were trying to do failed when actually it didn't fail, it highlighted areas that needed more attention.

Peter

Yes, indeed. And it's that that piece where people are often scared of the rework that comes with that process. It's the the sunk cost of having uh, hey, I've started down this path. Am I willing to actually tear down what I've already built and learn? Uh, and realizing that, yeah, I mean, all of these pieces that I've put together, they've uh they've served the purpose of me learning what not to do. Now I can learn what to do and do what what happens next. How do I build on that? How do I I have to be willing to uh admit that something's gone wrong to be able to learn something?

Dave

I love that that sort of fallacy that so many uh and I bump into this all the time that rework is somehow bad and it's uncommon when actually rework is the work and it's not just common, it's it's everywhere, it's what you do all the time. It's you're always coming out. You when you finish doing something, you realize there's a better way of doing it. The question is more do we go back in and rework that or do we save that for another day when it's when it's a priority again and go back and apply what we've learned? But that's pretty much the norm, not an exception.

Peter

Yep, and and it's important, especially when we're starting to build out uh resilient systems, because we we want to learn. We want to learn, oh, that thing that we built, that architecture we built, those pieces we put into place, and we've learned that that isn't going to serve us. We need to work out well, what can we do better next time? We want to learn from um what's occurring in our production environments, and we want to learn how can we apply that to get better at building a more resilient system, uh, with certain uh sort of caveats around that. I mean things like and how much do we need, like how much resilience do we need to build in? What's yeah, because we could go to the to the the moon and back and build something that's uh as resilient as we possibly can make it, but there's a cost to all of this, both in uh in time and money.

Dave

Well, I find it interesting. So so perhaps we should just take a moment to define what we mean by resilience, because it's a little bit like agility in the sense that there's a tendency to think it's an all or nothing. We're either agile or we're not. When the reality is there'll be different levels, you know, different levels of application of agile principles, if you like. There are parts of our organization that should be all the way over, and agility should be in everything that they do. But there are other parts of the organization for whatever reason that it just might not be needed there. It's just not the priority, or there may even be other ways of working which are better. So if we start with resilience, what's your definition of resilience?

Peter

So a definition of uh resilience is a system that can recover from uh failure, a system that's uh capable of responding and continuing to respond, even as parts of that system will fail or go away or not not exist, and even if it's responding in a degradated manner. Now we need to, of course, understand how degradated is allowable. That gets us into a deeper level of conversation around what that means. Um but then that that's kind of how I think of resilience.

Dave

And I I think I'd clarify one word there. When you're talking about systems, it feels to me like you're talking about technical systems like software and hardware and bits and pieces that we're working on, but you're actually also talking about organizations, about non-technical systems, the systems in the case of systems thinking, not in the case of a technology system.

Peter

Yes, yes, very much so. I mean, it's all the different pieces that make up the delivery of a service.

Dave

And I find when you look at the organization and we or the system itself, one of the things to recognize is there are systems that are designed to break as part of their resilience. There are systems which are not designed to break but are designed to continue operating. And so there's a lot of different nuances in that sort of resilience. And when I was describing that in my head, I'm thinking of things like the collarbone. The collarbone is a is a fail-safe that is designed to break under stress in particular situations to protect the rest of the body. So there's the resilience that is uh a degradation of the system in the short term to protect the system as a whole overall. There's also that resilience of, you know, I should be able to bounce back from whatever is happening. That system should not be knocked out of commission by any sort of major jolt to the system.

Peter

Yeah, and that's the the muscles are like work like this too, the whole anti-fragility concept where it's like we we break it to make it stronger, we we stress it to make it stronger. And and this is where you can look at things like um chaos engineering is an example of this into a into a technical systems where where we we purposefully stress the system, we purposefully break it so we can learn where are the breaking points in it so that we can make it stronger and we can build build back stronger as a consequence.

Dave

Or or I think learn. I'm just we're not saying we go around breaking bones to make them stronger, but no, um the the but but uh when we stress the system, the system learns where the stresses come from, and if it's able to learn and go in there and and kind of re-jig things, it then kind of comes out the other end stronger. Um it actually reminds me of a this interesting thing when we talk about investment and we've got businesses and organizations. I read a great little, I mean, I read title and probably the first paragraph of an article about Amazon and how they've one of the things that they've done in the last couple of years is invest in their own cargo ships. Now they're smaller than the big cargo ships that we're used to seeing, effectively stuck in the Suez Canal and other places like that, and queuing up outside the major ports around the world. But with by with having their own cargo ships, which are slightly smaller, all of a sudden Amazon is able to bypass some of the bottlenecks that we're all seeing in terms of the supply chain and go to smaller ports and have control over that. And that's this great example of a system that they understand where the risks are and they've invested to build resilience, not into the entire kind of value chain supply chain piece, but into parts of it to give them a little bit more flexibility or more resilience in their business model.

Peter

Yeah, and uh you see that in all of these places. I mean, Justin Tyne got a lot of flack during uh the last few years, but you see this in uh what uh Toyota did with uh their manufacturing. They it takes a hundred thousand parts to make a car. Uh they went through and understood and worked out what are the 500 that they always know they're gonna need or that might run out, and they made sure they always had 10,000 of those on hand at any one point, so that when they got hit by problems supply chain, they were less impacted than other organizations. They they were much it was much easier for them to bounce back because they had a much deeper understanding of their system because they've taken the time to learn it.

Dave

So, what I'm hearing as we're just describing this, one thing that strikes me is this sort of fur you start when building resilient systems, you start by understanding what that system is, how it works, not just what it does, but also what are the constituent elements within that system and where the risk lies, where they where you need resilience versus where you can live with an element of fragility or an element of of no risk.

Peter

Interestingly enough, if you uh look at the bottom layer, there was a gentleman called Dixon who created a pyramid for, and I think it was a telco he was working with for the adoption of SRE practices. And uh the bottom layer of the pyramid is monitoring. It's exactly that. Like identify how your system works, identify all of the pieces of it. So you that's your base layer. We need to we need to know what's going on, we need to make it visible, as we so often have talked about.

Dave

So monitoring, understanding the system is that first step of just being able to picture where things are, and I think that's something that what we often see is the goal is to be car have a resilient system without really appreciating what parts of the system should be the focus of attention. I think this draws us to that whole point that we're not saying that every element of that system has to be resilient, there will be different levels of resiliency required depending on our dependency or the risk associated with each of those. So the second part is clearly to allocate some sort of risk to what we see as we understand our system in in more detail. And then once we understand those risks or the level of resiliency that is required, we can start building that resiliency in.

Peter

Exactly, exactly. And then that that gives us a roadmap to start to look at what is the resiliency of our system, how do we improve it, how do we introduce elements that are going to help, and where do we need to invest? And that's how we start to build on that uh that model. So I think with all of that, how would we like to sum this up for our listeners today?

Dave

I really liked where we started at the outset, which is um understanding resiliency as a goal for a part of our system, not the whole system. And I think we talk about resilient systems and give this impression that the entire system is there when actually there's a granularity, there's a level of detail and understanding we need so that we can invest in the right amount in the right place. So I think that that was something that really kind of came across as we were having that conversation is this idea that while we talk of resilient systems, there are parts of the system that we're going to invest a lot more time and energy on. And there are parts of the system that will live with this this sort of state that the system is as we as we find it.

Peter

And I I think I agree. And I think one of the pieces I would add to that is the the learning from the failure that your your system will fail. And that uh and in fact you you want it to fail. Fail is good, fail allows us to learn, and then we can rebuild stronger. So this this idea that uh it's not a bad thing when things go wrong, and that it's a good thing for that is because it's creating a learning opportunity. And having that in our approach uh is is a good way to start to look at things.

Dave

Well, and it's also the recognition, I think this is exactly right. We the recognition that those first early failures, which might feel in some cases like a sign that we should turn back, are actually a sign that we're going in the right direction. So there's a little bit of nuance in there, but we we need to persevere through that to gain the benefits. We're not going to avoid those sorts of steps that we take where we miss something, because by definition, we don't necessarily know where to look and we're going to miss them.

Peter

Exactly. So with that, I think we can wrap up for for today. I really enjoyed the conversation as always. If uh if you want to reach out, you can uh find us at feedback at definitely maybeagile.com. And uh thank you, Dave.

Dave

Perfect, great uh conversation again, Peter. Thanks again.

Peter

You've been listening to Definitely Maybe Agile, the podcast where your hosts Peter Madison and David Sharrock focus on the art and science of digital, agile, and DevOps at scale.