Definitely, Maybe Agile

Managing SPOF (Single Point of Failure)

Peter Maddison and Dave Sharrock Season 1 Episode 42

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 19:39

 Peter and Dave talk about the importance of managing SPOF (Single Point Of Failure) this week. This issue can have a negative effect on an organization if not handled correctly, so stay tuned! 

 This week takeaways: 

  • A single point of failure in an organization does not mean that organization has failed. 
  • Identify them. 
  • Build redundancy or slack into the system 
  • Encourage the habit of continuously sharing knowledge 


   References in this episode: 

The Phoenix Project- by Gene Kim, Kevin Behr, George Spafford - https://www.goodreads.com/book/show/17255186-the-phoenix-project 
 
  We love to hear feedback! If you have questions, would like to propose a topic, or even join us for a conversation, contact us here: feedback@definitelymaybeagile.com 

Peter

Welcome to Definitely Maybe Agile, the podcast where Peter Maddison and Dave Sharrock discuss the complexities of adopting new ways of working at scale. Hello, and welcome to another exciting episode of Definitely Maybe Agile with your hosts, Dave Sharrock and Peter Maddison. So what's on your mind today, Dave?

Dave

Peter, great to catch up with you. What's on my mind? Um, how do I put this? You have you ever worked in an organization where there is one person that seems to have the keys to the kingdom? Every single project, every change to the system depends on that one person's availability.

Peter

I would say I've worked in a few organizations like that, yes. Like all of them, in fact.

Dave

And yes, well, of course. I think that's something that we we're very, very familiar with. But um, what impact does that have on an organization?

Peter

So generally, what we find, of course, is that if there's a single point where everything has to funnel through that place, then things get delayed and they pile up and they wait for that person. It also means that if that person, I don't know, goes on vacation, if they're ever allowed to, uh then everything just grinds to a halt. Nothing gets done. Or we're we're unable to move forward. And if all of the knowledge resides just in that one place, then we run into problems with uh how do we make decisions? Because it becomes almost impossible to make decisions when you've just got one place that knows how to do things and how to move things forward.

Dave

So well, it's an interesting one. I think it's a consequence of, you know, I I mean, I I what what I find really interesting is you can almost see these individuals who become, you know, they're very, very talented, they're very knowledgeable, they're enthusiastic and supportive, they do anything and everything that they can to move the organization forward. So I think in their role, this is this is all positive. But what you then end up often doing is, oh, hold on, this this is the person who knows about this. And so that resource management piece of who is the most appropriate person to bring to the table for this conversation, unfortunately has an unintended consequence of building specialization, of creating individuals who become the breaking point. But it's not that the individual is caught consciously doing that, but they become this bottleneck that means you can't move without having that individual involved in a conversation, accessing, you know, making changes to the to the code base, whatever it might be, because they're the ones who have all of that knowledge. Um, and even worse, they can't write it down. Yeah, exactly. Because no, no, exactly, because they don't have time.

Peter

Yeah, yeah, because they're so busy in all those meetings because they need to be in all the meetings, and then it's uh and they they've got to be the one that's there for everything. So yeah, yes, totally. It's uh and and it's always or nearly always comes out of a lot of good intention. There's a lot of uh this person's tried to solve problems and they've been successful at solving problems, and as a consequence, they've become very, very knowledgeable, and uh they they've become that place where we go, and uh to because we know that if we go there we're gonna get an answer that's gonna allow us to move forward. So everybody goes there and uh starts to broaden in. If you look at uh the the Phoenix Project is a classic book, there's a character in there called Brent that uh everyone refers to and that uh and Brent is the the person that uh uh everything in the organization revolves around. If anything's gonna happen, whether it's in the development area or it's happening in the infrastructure area, it has to go through Brent because he's the one who knows. And if something goes bump in the middle of the night, it's Brent that they call and get on the line because he's the person who knows what is uh needs to happen to fix it.

Dave

Yeah, this is so true. And I think um there's a couple of things that that bounce around when we start when when I s start talking to an organization, and there's very clearly a Brent in the organization, of which many organizations have that. The follow-through is is quite interesting. I mean, the first thing uh is sometimes you have that almost denial, you know. We just don't under you don't understand Brent is just, I'm gonna use Brent as the the moniker here, but Brent is the person, you know, he's this is the individual we need to bring to the table, and and it's just the way things are. And I think that first point of recognition is those single individuals, single points of failure as they're often referred to, they're they're those individuals will occur. They're going to occur just as you get natural movement, attrition, or or promotions, or people shifting around on your teams. Occasionally you're going to be left where there is one individual that how holds a lot of information. So it's not a bad thing that they happen once in a while. It's a just nature's way of, you know, things move backwards and forwards, and occasionally that will happen.

Peter

Is then what do you do about it afterwards? And recognizing that it's there and that it exists, and then taking action to ensure that you're look saying, okay, so you seem to be the person where all of this is the world is starting to revolve around. How do we start to uh distribute that knowledge? How do we start to make it um what you know more available and accessible to others so others can take this these tasks on, so others can uh do the things that you know how to do um so well, and that we can bring the other people up so that it's not just you who has to be there for everything that needs to happen.

Dave

And I I think this is where it starts getting a bit sticky. Because really, this becomes a leadership or a management responsibility of identifying it's a it's a strategic risk in your organization. Either it's a strategic risk because it's there, or it's a potential risk because it can begin emerging as being a problem. And the headache we often have is that efficiency is the driver. We're all of us working in environments in organizations where there is way, way more work than time and capacity to deliver that. So, in a sense, there's always that push to do more and that efficiency drive. And the catch is that there really is only one way to resolve this, and that is to build redundancy, to create slack in the system, two people who know the same as one person, which on the surface is immediately conflict with that focus on efficiency and on getting as much done as fast as possible. Yeah.

Peter

This is and this is, I think you're you're right there. This one of those things that uh is driven by this desire to have everybody working at a 110% to get all of these different things done, rather than understanding that it's necessary to have that slacking since so that we've got time for learning, so we've got time to share ideas, so we've got time in the system to be able to ensure that knowledge is properly distributed. And we we forget to put the effort into making that happen, and or or even that it is effort to do it. I was uh talking to somebody earlier today about this where they were describing some similar problems in their their organization around um, well, how do you get that that knowledge distributed? And uh so I listed out sort of, well, you there's lots of ways you can do that, and uh then they were like going, okay, so if we have all of these different ways of uh of distributing knowledge, how do we make time for to ensure that that happens? How do we make time to ensure that we communicate correctly and that when we're making changes into the organization that we're properly distributing knowledge and that we're ensuring that we don't end up with like one person who knows everything about what it is that needs to happen, and that we're all uh coming to some form of alignment or we're all coming to a level where we understand what needs to happen and that there isn't just that single point where uh everything revolves around uh what that person knows.

Dave

Yeah, I mean it's it's a difficult one because um the reality is you need to have a conversation with some stakeholders to say things have got to slow down. We can't keep at this pace. And and I think that's why recognizing that it can happen and that there are periods of time when, I mean, you and I were just chatting about this, we're running up to this is a very, very busy time of year for people in our profession. We're overloaded in many ways, right? We have lots and lots of things going on, but it's for a short period of time, and we know we're going to hit a period when the the workload drops and we can adjust accordingly. So there's that element. There is often, you know, uh hold on to the reins for for a few sprints or a month or two because this is just what has to happen here, and I think that's the reality in many situations. The headache becomes when that's the norm, not an exception. And so now what you're actually doing is you're running effectively running the engine hot and something will break, and when it breaks, it's pretty nasty. It's it's the engine doesn't just run slow or it stops.

Peter

Yeah, totally. And uh and I I've seen this in organizations that I'm sure you have, where it's uh because people will uh eventually uh either burn out, they might get sick, or they might just leave. And at which point uh all of that knowledge is going to walk out the door, and all of a sudden you're in a far, far worse situation than you uh were before. Where and so I think that this taking this back to one of the things you said earlier, there is this role of leadership to identify this because it can be hard as well for the person who is in that position to realize that uh that's what's happening, uh, because they can just feel like they're getting very, very, very busy. They've got a lot on their plate and they can enjoy it as well. I mean, there is also this is a challenge, and they and it's nice to feel wanted. And we all want to feel wanted, uh, and so it can be hard for them to recognize that there is a need to um to distribute that mind, to distribute the the functions more because um it it's not good for them either. It can be bad for you from a from a health perspective and from a just a general stress perspective if you're the person who's always has to be there.

Dave

I think that point is really very valuable because I I can't think of a single situation where anybody in that role has self been self-aware of it um without it being brought to their attention. So I've certainly bumped into people who've know that they are in that role and they they're aware of it, but it's not something that happens yourself. It's that whole boiling the frog problem, right? When you the temperature increases l degree by degree and you don't actually move because, well, it feels warm and it feels good to be an important cog in the machine when or a critical cog in the machine as it's getting um getting built. But when you realize you're the only cog that is always there, whether it's 4 a.m. or a Saturday or a Sunday or whatever it might be, then it's almost too late. So that falls back to managers or the the leaders of that organization. And and there I think is a little bit that's where we've got to get that balance right about it's almost being a gatekeeper to work with the stakeholders, the many people trying to get things to manage expectations and not not just be a flow through of this work needs to be done, but really to throttle back so that you're able to build the resilience in the team and the the redundancy in knowledge and expertise that is necessary to function beyond just what you're delivering at the moment, to function safely.

Peter

Yes, exactly. And uh there's there's pieces and practices we can put in place. And we we were talking a little bit before we started recording around this about the um how you build resilient systems and uh by the short increments, uh fast feedback loops, actually putting the practices into place to make sure we are actually looking at uh what happened, what went wrong, and learning from it and building the learning back in and actually making sure that action occurs based on that learning. Uh, there's there's lots of great practices um out of the ITSM space and uh SOE around this. Uh, repair debt is an example of this. This is this idea that we we actively track what is the action that needs to occur in order to resolve the root cause, and then we we track those metrics over time. We ensure that we're paying that down, we're paying that debt down, we're making sure that um that we're taking action to improve the resiliency of our systems. Uh, because this in turn will help uh remove uh those single points of failure too, by making the system resilient. It's uh it's just we we fall to the level of our systems.

Dave

Well, I think what what what I find interesting here is because what you're just describing is you know small continuous improvement. That's what everybody talks about, small continuous improvement. Um, as you're talking about that, I have a picture in my mind of brushing your teeth versus going to the dentist. And there's a huge difference between not brushing my teeth and going to the dentist every three, six months, twelve months, whatever it is, to get everything fixed. And in many and and the habit that has to be built up to brush your teeth, or as you know, we're all learning there's more to teeth, uh mouth hygiene than that, so flossing and all of these things that go with it. These are habits that prevent the big dentist bills and everything else. And and the I think it's really important in organizations to recognize that there is a need for a habit of continuously correcting, repairing debt, continuously revising and and tweaking and changing as a result, because it's not something that we can recover from by doing everything every six months when something goes wrong, like going to the dentist every six months or a year, is not the same as using these habits that we use continually. So those you know, the the the interventions that are required are significantly less frequent and significantly less severe.

Peter

Yeah, actively building into your systems and your practices so that you are understanding that it takes time and it takes work, and we need time to be able to reflect, we need time to be able to look at what we're doing and say, okay, how can we do this better next time? And how can we build that in? And we if we're if we're trying to make sure that we get every single new thing, new feature and every piece built in, and where everybody's working at 110% to make that happen, then uh that time isn't there. And we without that slack, then things will rapidly start to um degrade and fail over time, and uh and you'll end up with these single points in the system as a consequence. You very often see that as a consequence of that.

Dave

Well, and and I think uh so so if we take the fact that we want to kind of have this as a habit that's happening continually, and you know, that's like knowledge sharing and uh taking a break to kind of build a bit of redundancy and or um make sure multiple people are aware of how things have been designed or what the risks are here or whatever it might be. But there's an one of the things to add on to that is making that learning visible. So, in uh too many times I've seen teams and the work that they are doing where that learning, the hey Peter, why don't you and I work together so I can learn from you things which only you know so that now there's two people who know this. This sort of work is hidden. It it has to be done behind closed doors in some ways. And I think it it should be very visible. I mean, I've seen teams expand on their estimates to add in a couple of extra points for learning where we might take a piece of work and we say, hey, this is a great opportunity for us to share that knowledge, let's make it a bigger estimate so we can share it. Uh, I've seen teams also, and I I love this one, where the this individual is the single point of failure is prevented from actually changing the code. They're only allowed to go in when somebody else is there. They can oversee it, they can say, Don't do that, it's going to break horribly. Or they can say, Yep, you're in the right area, go with that. But now you're ending up with two people beginning to gain confidence instead of one, and uh bearing in mind how helpful these individuals are, actually sort of barring them from you know, having making them sit on their hands is a great way of just making sure the other individuals are learning and they're not just watching. Because we've all seen, you know, we watch masters at work, we nod our head and go, how hard can that be? And of course, it's not easy. It's not the way it works, right?

Peter

I think we've covered a lot of really, really great um material uh in this conversation. I really enjoyed this. Uh uh if we were to start to sum this up, um where would you go? What what sort of three points would you like to sum this up with?

Dave

Um let me pick one of them, is that first observation we started with, which is A, it happened. And we shouldn't, just because there's a single point of failure in an organization, uh, does not mean that organization has failed. But if that continues to be the case, that's uh an indication of failure. It's something that we're going to see occasionally, but we need to actively manage and actively encourage so that that removes it. You know, they remove themselves as a single point of failure.

Peter

I think there's another piece building on that too, is that they will happen, so look for them. That's that's a great point.

Dave

Yes, yes.

Peter

So if you're in a leadership point, we know they will happen. So be aware, look for them, see, notice, uh, be aware of what's uh happened.

Dave

I I love that. Almost to the point of if you don't have one, either it's you or or you're not looking hard enough, which is a great point. Yeah. I think uh another piece is um they this you need to build redundancy or slack into the system. However you look at it, the the constant pursuit of efficiency and of maximizing output is not going to create the space that you need. So so there has to be some control on there, there has to be some recognition that there is a there's redundancy to be built. It's not bad that two people or three people or four people know the configuration settings in your network. It's actually a good thing. You've got flexibility there, you've got resilience being built in. And maybe the third one is is the need for that can the habit of continuously sharing knowledge. And we talked about that in two ways. One is that habit. You you introduced the concept of repair debt, right? And things like that from the uh IT service management space. The other thing that we kind of added on to that was the idea of making it visible so it's not kind of hidden under the table, but is actually something that is recognized and encouraged and um rewarded in some ways. It's it's considered professional delivery rather than something you've got to do behind closed doors.

Peter

Yeah, that knowledge is shared and that we make time for that sharing and that we put the effort into that uh with the understanding that it takes work to do this. I often encourage or if organizations are are looking to uh try and encourage new practice and change, get somebody who is going to be solely focused on the communication as a part of that because it's such a critical part of uh what needs to happen. And thinking of it as a side-of-the-desk activity uh will often result in the well poorer results than you would like to see from what you're trying to do. Uh well well, with that, I think we're out of time for today. And uh it as always it's been a fantastic conversation. I'd like to thank you for that. And uh if people would like to reach out, they can at uh definitely maybeagile.com. And uh thank you.

Dave

Looking forward to our next conversation. Thanks again, Peter.

Peter

You've been listening to Definitely Maybe Agile, the podcast where your hosts, Peter Maddison and Dave Sharrock, focus on the art and science of digital, agile, and DevOps at scale.