
Definitely, Maybe Agile
Definitely, Maybe Agile
The Hidden Cost of Temporary Fixes
Every technical system harbors its share of quick fixes and band-aids – those temporary solutions we implement with the best intentions of returning to fix properly "someday." But what happens when that day never comes?
Peter Maddison and David Sharrock dive deep into what they call "longstanding risks" – the accumulated technical debt that results from prioritizing expediency over completeness. Through a relatable example of a memory-leaking service that gets automatically restarted rather than properly fixed, they unpack the hidden costs of these decisions. The conversation reveals how seemingly minor shortcuts can gradually transform robust systems into fragile, unmaintainable messes.
The hosts share a compelling analogy about a utility company that saved money by skipping tree trimming around power lines for just one year – only to face significantly higher costs from the resulting infrastructure damage. This perfectly illustrates how short-term thinking about technical maintenance creates expensive long-term consequences. They offer practical recommendations including proper documentation of temporary fixes, avoiding team overload, and maintaining good system hygiene.
What makes this episode particularly valuable is the mindset shift it advocates: moving from attempting to prevent all possible failures to building systems that remain resilient when inevitable problems occur. As Sharrock references from safety expert Sidney Decker's work, sometimes the best approach is focusing on what makes your system work well rather than obsessively eliminating every risk. Whether you're managing complex technical systems or leading transformation efforts, these insights will help you balance pragmatic solutions with long-term system health.
Welcome to Definitely Maybe Agile, the podcast where Peter Maddison and David Sharrock discuss the complexities of adopting new ways of working at scale. Hello, Dave, how are you today? Peter?
Dave:great to catch up with you. What are we chatting about today?
Peter:We're going to talk about longstanding risks, at least that's what I'm going to call it for now. We may decide to call it something else as we work through this. But it came from a conversation that I've been having recently around. When you have the identification of a risk and you do risk work in the system to mitigate that risk, then sometimes you might put a quick bandaid in place, you might put something in to fix it, to mitigate it, to say, okay, this should prevent that risk from occurring.
Peter:But then there's going to be things that potentially you're not going to be able to get to that you might want to go and follow up on later. The example I was using was I've got a service that has a memory leak in it and I know it's going to run out of memory once a week, has a memory leak in it and so and I know it's going to run out of memory once a week, so I put a job into place to restart it on a less than a weekly basis so that I it always has enough memory. But at some point I'm going to need to go back and actually fix the memory.
Dave:It's. It's interesting because I know just as we were chatting before coming into this, we were talking about risk being something that we start seeing a lot. Unfortunately, risk is also that it's either a scary word we don't want to talk about too much or it's a boring, mundane, analytical type of thing that we have to look at from a risk, registry, tracking, mitigating piece. So, if I reframe what you're just talking about, one way of looking at it is we have a finite amount of time. So rather than saying we should go in and fix that memory leak, it's obvious we should go into the code, find out where the memory leak is coming from and fixing it.
Dave:That's that black and white fix the defects that we find out, fix the behaviors which aren't helping us. The reality is we have to continuously make trade-offs. Sometimes there's a quick fix rebooting the service that allows us to get by and not invest the time and energy required to find something that's unknown, and I think that's really what we're trying to look at. Is this trade-off of when do we go and solve a problem and call it solved and that's it, and when do we do just enough to kind of keep us on the straight and narrow and keep moving, and what are the consequences of that?
Peter:Exactly, and it's an imperfect example I mean, there are better ones Just the one that came to mind. That's very easy to explain and something that people may have run into, and quite often in that instance it's a a who's going to look at doing this well, and the reality is that people aren't going to because there's something else that's causing more pain, which you're going to get dragged away to and end up working on.
Dave:It's something that we don't get a chance to think about, except for when we've got a bit of time and we've got a coffee in our hands or we can talk about this hypothetical. The reality is, in many cases we ignore or we fix and move on, and the concept, you know, we've not documented it properly we move on to something else. And then we have these behaviors, these snippets of code, these little scripts that we have running, that nobody knows why they're running, but we also know that if we start switching them off and cleaning them up, something horrible is going to happen. So they persist.
Peter:Yeah, there's this web of spaghetti, where, if we've got this large, complex system, what happens when I turn this service off? Yeah, and if you don't have a good map and understanding of like how these things relate, or even if you do, it may have unintended consequences because there's a dependency you want to wear off.
Dave:So what are we exploring? What's your recommendation in that world where you've got a cron job and you're rebooting this service Part of it?
Peter:is who has the time and who has to respond. So very often I think in perfect example, and very often in that case in my experience operations will put the cron job into place because they're tired of getting woken up at two o'clock in the morning. So whoever is in charge of operations that it would be them, which is a one of the reasons. If the team is the one who is in charge of the operations, they'll be the one to put in as a cron job to restart it because they don't want to be woken up. So there's there are. There's definitely some logic there, from that shifting of accountability to where it belongs will help people do the right thing. But even then, one of the other things that you need to look at is why don't we have time to do the right thing? If it's really no, it's going to fail in the next week, then yes, put the triage in, but shouldn't we look at fixing that in the next seven days?
Dave:So it's. We all know examples where there are things. In fact, if anybody listening to this kind of writes down the three things they know they've got to go and fix. You're aware of the weaknesses in systems and how you're working, whatever it might be, and we know that we're all continually making trade-offs. Now, for one of the habits that we want to generate in in the transformations that we make in organizations is helping recreate the space to be able to decide whether or not a cron job is sufficient and a bit of documentation and keep an eye on that, and just you know that's doing its job. So let's ignore it and leave it versus whether or not we now actually have to go in and properly understand what's causing the memory leak and fix that.
Dave:Those sorts of things come out from having space to evaluate the work coming at you. So the first thing comes into teams have to be able to pull work in. If they're not pulling that work in, they're going to get overloaded. And when things are overloaded, crazy decisions get made. And they're not crazy in the individual decisions, but they end up being these things which make a fragile ecosystem that we then becomes unmaintainable at some point. But then the second thing is building the habits in. So things that we talk about, you know, non-functional requirements, definitions of done, all of these things are kind of practices and vocabulary to generate the conversation around these things of do we, how do we treat these kind of small things that don't feel very risky right now, but maybe they're really risky, maybe they're not.
Peter:And I like what you're saying there. It is about freeing up space for the teams to start to have the right conversations about prioritizing these pieces. Understanding risk is actually a part of this. Like, how risky is this? Is it really that critical we fix this back-end service? Is it that impactful to the system that the cron job isn't sufficient, if you like, kirstie and we're probably dating ourselves describing as a cron job these days.
Peter:But there's the conceptual idea, kirstie, keen 1.0 is. There is an understanding and time to think through how are we going to respond to this, but also to create some kind of documentation around the fact that it was done and the reasons it was done that way. Because one of the other common problems you come across is well, I've got all of these pieces lying all over the place doing all of these different things and nobody knows why they're doing it anymore, because the person who put it, set it up, has left and it's no longer clear and you spend a lot of cycles digging into well, the service, this thing's restarting, doesn't even exist anymore. So why are we running this thing and you run into a lot of this if there isn't that good hygiene as well?
Dave:but? But there's an ownership piece as well, because I think everybody has had the experience of um, a well-meaning leader or manager talking about things like defects in a system and saying, look, we're running out of time, let's track the defects so we know where they are, but not worry about fixing. We've got to move, to get to hit something and and there's this sort of point of no return where all of a sudden you move from a system that is robust and and resilient and you can work in and you know where things are going, into a system that just starts degrading and becoming nearly impossible to work in. And while we're talking, I'm just going to introduce the story that I heard last week.
Dave:I was chatting to some friends and they were talking about a utility that saved itself a few hundred thousand dollars for utility power lines. Everywhere they spend a lot of money on trimming the trees around the power lines and they saved money one year a few hundred thousand by just not cutting the trees that year, because a year's growth. What possible damage could it do? And of course then, as they looked back over it over the following years, the damage was quite considerable. Because you know, yes, they cut it back the following year, but already it had weakened the infrastructure and bumped into and damaged power lines and damaged various other parts of the infrastructure. So the overall cost was tremendous over the longer run. And these are their sort of false decisions, false sense of security. This change is not going to really undermine where we're at, but really can have huge negative consequences.
Peter:And that comes from understanding what are the consequences of this, what is the risk that I have here If I don't take this action, which is a normal action, I do on a regular basis. What are the consequences going to be? And quite often in those circumstances, new leaders coming in, new people coming in, don't necessarily have the context or the understanding of so. Why are we trimming the trees back every year? I mean, they don't grow that fast the trees, so why bother doing that? Why don't we just do it every other year? And not understanding the consequences of that? And I imagine if you go back in time you'll probably find a time where somebody else tried this as well.
Dave:We're coming back to. If we're thoughtful about when we make changes to our system we put some new things in then we're also going to be thoughtful about what's required to maintain and operate that. So that moves into those operating costs. Great. Now, if we trust that we shouldn't be hacking the operating costs too much unless there's economies of scale, new technologies, whatever it might be but those operating costs are then thoughtfully put together. They're not there to keep an operating team running, they're there to keep the system running and therefore they should, at least if we trust that that should be the sort of minimum that we're doing, not an area that we can go and slice Exactly, and this has always been, that there's a need to have that operating cost, especially in anything related to technology, and this often becomes the first target of it's like well, why do we need all those operating costs?
Peter:But which is why most operating budgets get cut by a certain amount every year. But there is a reality that you need a level of operations to manage and sustain and maintain the hygiene of the system and that not doing so could potentially be much, much more expensive, even in the short term. If you start to, for example, remove systems that are ensuring that you keep up to date with patching and other types of basic system hygiene, if you haven't got the right pieces in place, then you can run into an awful lot of problems.
Dave:So, howie, I'm just thinking we're getting very philosophical is one way of thinking about it. How do we get to the point where there's a clean takeaway around this? I think you started really clearly with a great example. What is your recommendation around that?
Peter:I think there's a few real crisp and clear recommendations that come out of this. One is that when you're making changes into systems, they need to ensure that they're well documented, easily tagged and easily findable. So you need to have good knowledge management around these types of changes, especially changes related to the mitigation of risk, especially when those changes related to mitigation of risks are not a complete solution. They're a band-aid. So you need to understand well we are applying this triage, this band-aid, at this point in time. We know that there's other actions that potentially will need to be taken at a future date. We need to make sure those things are captured and documented and that they are referenceable so that people can find them and understand them and that the people who come after you can find them and understand them so that it's easy for them to see why things were done the way that they were done there.
Peter:Another part of it that you brought up and put very well out there, I think, is avoiding the overloading of teams. If teams are overloaded, context switching all of the time, they will miss things and because they don't have time to think through it, they'll say, well, we can just clean this up, nobody's using it, and they'll say let's just get rid of these feature flags or we'll turn off these other pieces and it'll have unintended consequences. So we need to make sure teams aren't overloaded so they have time to think through these problems, and the older and more complex your system is, the more time they're going to need to do that, which is yet another reason you should be modernizing your systems. Yes, so they speed into each other. It's very much a key part of it as well.
Peter:I think another piece which I don't think we've necessarily covered just yet in what we were talking about uh, directly, but we were talking about before we started recording was that the whole known, unknowns and unknown unknowns piece of it that you you may think you know what all of the possible risks are and you'll have put mitigation in for everything you know about, but there are still going to be things that are going to come at you from out of left field and cause problems that you're going to need to be prepared for and have the right capabilities in place to deal with, but you're not necessarily going to be able to directly mitigate and plan for at this moment in time.
Dave:When you're describing that, peter, I always think of that. There's this sort of a shift in mindset from preventing anything going wrong, and therefore doing everything you can to prevent it going wrong, versus a mindset that says something is going to go wrong. So how do I mitigate the impact of the things going wrong that I may not know about right now and that mindset is a it's almost like that's a generational thing in some ways, just because context and a lot of different things have changed and moved, but that is. It is a really difficult change to make. If I've always been of the view that I can control everything and we can we can prevent anything going wrong versus if something goes wrong, how are we managing the consequences to minimize the downside?
Peter:the downside. And I'd go even one step further and take a page out of Sidney Decker's book where he's talking about safety one and safety two systems that to reduce the risk in the system, you emphasize what works and you reinforce and strengthen those parts of the system because you can't deal with the things that you don't know are going to happen. So all you can do is figure out what makes the system work and reinforce that and make that go are going to happen. So all you can do is figure out what makes the system work and reinforce that and make that go really really well. And if one of the things you know that makes the system work is that you ensure that you have those operational processes in place, that's a part of making the system work For sure.
Dave:I think just I might add one other thing. Just we've clarified that, but one other thing that we did touch on is that whole conversation slash practices around hygiene, around the things that you know. They're not always the new innovative things, but they're the basics we need to keep our work environment, our ecosystems, healthy and maintainable and things that are a pleasure to work in rather than incredibly frustrating and challenging to work in.
Peter:Yes, yeah, and we want to remove the toil out of the system. We want to make these things easy to do, and you want the system to be fun to work with, so you want to automate as much of that as possible so that it's not a set of manual boring tasks that people have to do. Because people are very bad at doing manual boring tasks Awesome Well. Because people are very bad at doing manual boring tasks Awesome Well. Thank you, as always, for the conversation, Dave. I really enjoyed it and I look forward to the next one. Until next time, Peter.
Peter:Thanks again, Until next time You've been listening to Definitely Maybe Agile the podcast where your hosts Peter.
Dave:Maddison and David.
Peter:Sharrock focus on the art and science of digital, agile and DevOps at scale.