The Hidden Cost of Temporary Fixes Artwork

Definitely, Maybe Agile

Adopting new ways of working like Agile and DevOps often falters further up the organization. Even in smaller organizations, it can be hard to get right. In this podcast, we are discussing the art and science of definitely, maybe achieving business agility in your organization.

All Episodes

Definitely, Maybe Agile

The Hidden Cost of Temporary Fixes

May 08, 2025 • Peter Maddison and Dave Sharrock • Season 3 • Episode 178

Quick fixes turning into permanent problems? You're not alone.

In this episode of Definitely Maybe Agile, Peter Maddison and Dave Sharrock tackle "longstanding risks" – those temporary solutions that somehow become permanent fixtures in our systems. From memory-leaking services to utility companies skipping maintenance, they explore how small shortcuts create massive long-term costs.

This Week's Takeaways:

Document your band-aids with clear reasoning and next steps
Protect team capacity to enable thoughtful decision-making
Focus on system resilience rather than trying to prevent every failure

Perfect for engineering leaders, DevOps teams, and anyone managing complex technical systems. Subscribe for more insights on scaling agile practices and building maintainable systems.

Peter: 0:04
Welcome to Definitely Maybe Agile, the podcast where Peter Maddison and David Sharrock discuss the complexities of adopting new ways of working at scale. Hello, Dave, how are you today?

Dave: 0:15
Great to catch up with you, Peter. What are we chatting about today?

Peter: 0:18
We're going to talk about longstanding risks - at least that's what I'm calling it for now. We might decide to call it something else as we work through this. But it came from a conversation I've been having recently about when you identify a risk and do risk work in the system to mitigate it. Sometimes you might put a quick band-aid in place, something to fix it, to mitigate it, to say "okay, this should prevent that risk from occurring."

But then there are things that you potentially won't be able to get to that you might want to follow up on later. The example I was using was: I've got a service that has a memory leak in it and I know it's going to run out of memory once a week. So I put a job into place to restart it on less than a weekly basis so that it always has enough memory. But at some point I'm going to need to go back and actually fix the memory leak.

Dave: 1:16
It's interesting because, as we were chatting before coming into this, we were talking about risk being something that we start seeing a lot. Unfortunately, risk is also either a scary word we don't want to talk about too much, or it's a boring, mundane, analytical type of thing that we have to look at from a risk registry, tracking, mitigating perspective.

So if I reframe what you're just talking about - one way of looking at it is we have a finite amount of time. Rather than saying we should go in and fix that memory leak - it's obvious we should go into the code, find out where the memory leak is coming from and fix it. That's that black and white "fix the defects that we find, fix the behaviors which aren't helping us."

The reality is we have to continuously make trade-offs. Sometimes there's a quick fix - rebooting the service - that allows us to get by and not invest the time and energy required to find something that's unknown. And I think that's really what we're trying to look at: this trade-off of when do we go and solve a problem and call it solved, and when do we do just enough to keep us on the straight and narrow and keep moving? And what are the consequences of that?

Peter: 2:32
Exactly. And it's an imperfect example - I mean, there are better ones. Just the one that came to mind that's very easy to explain and something that people may have run into. And quite often in that instance, it's a question of who's going to look at doing this work. The reality is that people aren't going to because there's something else that's causing more pain, which you're going to get dragged away to and end up working on.

Dave: 2:55
It's something that we don't get a chance to think about, except for when we've got a bit of time and we've got a coffee in our hands and we can talk about this hypothetical. The reality is, in many cases we ignore it or we fix and move on. The concept - we've not documented it properly, we move on to something else. And then we have these behaviors, these snippets of code, these little scripts that we have running, that nobody knows why they're running. But we also know that if we start switching them off and cleaning them up, something horrible is going to happen. So they persist.

Peter: 3:27
Yeah, there's this web of spaghetti where, if we've got this large, complex system, what happens when I turn this service off? And if you don't have a good map and understanding of how these things relate, or even if you do, it may have unintended consequences because there's a dependency you weren't aware of.

Dave: 3:46
So what are we exploring? What's your recommendation in that world where you've got a cron job and you're rebooting this service?

Peter: 3:55
Part of it is who has the time and who has to respond. So very often - and I think it's a perfect example - in my experience, operations will put the cron job into place because they're tired of getting woken up at two o'clock in the morning. So whoever is in charge of operations, it would be them. Which is one of the reasons why if the team is the one who is in charge of the operations, they'll be the one to put in the cron job to restart it because they don't want to be woken up.

So there's definitely some logic there. That shifting of accountability to where it belongs will help people do the right thing. But even then, one of the other things that you need to look at is: why don't we have time to do the right thing? If it's really going to fail in the next week, then yes, put the triage in. But shouldn't we look at fixing that in the next seven days?

Dave: 4:47
So... we all know examples where there are things - in fact, if anybody listening to this writes down the three things they know they've got to go and fix, you're aware of the weaknesses in systems and how you're working, whatever it might be. And we know that we're all continually making trade-offs.

Now, one of the habits that we want to generate in the transformations that we make in organizations is helping recreate the space to be able to decide whether or not a cron job is sufficient and a bit of documentation and keep an eye on that - just, you know, that's doing its job, so let's ignore it and leave it - versus whether or not we now actually have to go in and properly understand what's causing the memory leak and fix that.

Those sorts of things come out from having space to evaluate the work coming at you. So the first thing is teams have to be able to pull work in. If they're not pulling that work in, they're going to get overloaded. And when things are overloaded, crazy decisions get made. And they're not crazy in the individual decisions, but they end up being these things which make a fragile ecosystem that then becomes unmaintainable at some point.

But then the second thing is building the habits in. So things that we talk about - non-functional requirements, definitions of done - all of these things are practices and vocabulary to generate the conversation around these things. Do we... how do we treat these kinds of small things that don't feel very risky right now, but maybe they're really risky? Maybe they're not.

Peter: 6:26
And I like what you're saying there. It is about freeing up space for the teams to start to have the right conversations about prioritizing these pieces. Understanding risk is actually a part of this. Like, how risky is this? Is it really that critical we fix this back-end service? Is it that impactful to the system that the cron job isn't sufficient? And we're probably dating ourselves describing it as a cron job these days.

But there's the conceptual idea. There is an understanding and time to think through: how are we going to respond to this? But also to create some kind of documentation around the fact that it was done and the reasons it was done that way. Because one of the other common problems you come across is: I've got all of these pieces lying all over the place doing all of these different things and nobody knows why they're doing it anymore, because the person who set it up has left and it's no longer clear. And you spend a lot of cycles digging into: well, this thing's restarting a service that doesn't even exist anymore. So why are we running this thing? And you run into a lot of this if there isn't that good hygiene as well.

Dave: 7:41
But there's an ownership piece as well, because I think everybody has had the experience of a well-meaning leader or manager talking about things like defects in a system and saying, "Look, we're running out of time. Let's track the defects so we know where they are, but not worry about fixing them. We've got to move to hit something." And there's this sort of point of no return where all of a sudden you move from a system that is robust and resilient and you can work in and you know where things are going, into a system that just starts degrading and becoming nearly impossible to work in.

And while we're talking, I'm just going to introduce a story that I heard last week. I was chatting to some friends and they were talking about a utility that saved itself a few hundred thousand dollars. For utility power lines, everywhere they spend a lot of money on trimming the trees around the power lines. And they saved money one year - a few hundred thousand - by just not cutting the trees that year, because a year's growth... what possible damage could it do?

And of course, as they looked back over it over the following years, the damage was quite considerable. Because yes, they cut it back the following year, but already it had weakened the infrastructure and bumped into and damaged power lines and damaged various other parts of the infrastructure. So the overall cost was tremendous over the longer run. And these are sort of false decisions, false sense of security. "This change is not going to really undermine where we're at," but really can have huge negative consequences.

Peter: 9:20
And that comes from understanding: what are the consequences of this? What is the risk that I have here if I don't take this action, which is a normal action I do on a regular basis? What are the consequences going to be? And quite often in those circumstances, new leaders coming in, new people coming in, don't necessarily have the context or the understanding of: why are we trimming the trees back every year? I mean, they don't grow that fast, so why bother doing that? Why don't we just do it every other year? And not understanding the consequences of that. And I imagine if you go back in time you'll probably find a time where somebody else tried this as well.

Dave: 10:05
We're coming back to: if we're thoughtful about when we make changes to our system and we put some new things in, then we're also going to be thoughtful about what's required to maintain and operate that. So that moves into those operating costs. Now, if we trust that we shouldn't be hacking the operating costs too much unless there's economies of scale, new technologies, whatever it might be - but those operating costs are then thoughtfully put together. They're not there to keep an operating team running, they're there to keep the system running. And therefore they should - at least if we trust that - that should be the minimum that we're doing, not an area that we can go and slice.

Peter: 10:57
Exactly. And this has always been that there's a need to have that operating cost, especially in anything related to technology. And this often becomes the first target - it's like, "Well, why do we need all those operating costs?" Which is why most operating budgets get cut by a certain amount every year. But there is a reality that you need a level of operations to manage and sustain and maintain the hygiene of the system. And that not doing so could potentially be much, much more expensive, even in the short term.

If you start to, for example, remove systems that are ensuring that you keep up to date with patching and other types of basic system hygiene, if you haven't got the right pieces in place, then you can run into an awful lot of problems.

Dave: 11:37
So... I'm just thinking we're getting very philosophical, in one way of thinking about it. How do we get to the point where there's a clean takeaway around this? I think you started really clearly with a great example. What is your recommendation around that?

Peter: 11:55
I think there are a few real crisp and clear recommendations that come out of this. One is that when you're making changes into systems, they need to ensure that they're well documented, easily tagged and easily findable. So you need to have good knowledge management around these types of changes, especially changes related to the mitigation of risk, especially when those changes related to mitigation of risks are not a complete solution. They're a band-aid.

So you need to understand: we are applying this triage, this band-aid, at this point in time. We know that there are other actions that potentially will need to be taken at a future date. We need to make sure those things are captured and documented and that they are referenceable so that people can find them and understand them. And that the people who come after you can find them and understand them so that it's easy for them to see why things were done the way that they were done.

Another part of it that you brought up and put very well out there, I think, is avoiding the overloading of teams. If teams are overloaded, context switching all of the time, they will miss things. And because they don't have time to think through it, they'll say, "Well, we can just clean this up, nobody's using it," and they'll say "Let's just get rid of these feature flags" or "We'll turn off these other pieces" and it'll have unintended consequences.

So we need to make sure teams aren't overloaded so they have time to think through these problems. And the older and more complex your system is, the more time they're going to need to do that, which is yet another reason you should be modernizing your systems. So they feed into each other. It's very much a key part of it as well.

I think another piece which I don't think we've necessarily covered just yet in what we were talking about directly, but we were talking about before we started recording, was that whole "known unknowns" and "unknown unknowns" piece of it. That you may think you know what all of the possible risks are and you'll have put mitigation in for everything you know about, but there are still going to be things that are going to come at you from out of left field and cause problems that you're going to need to be prepared for and have the right capabilities in place to deal with. But you're not necessarily going to be able to directly mitigate and plan for at this moment in time.

Dave: 14:24
When you're describing that, Peter, I always think of this sort of shift in mindset from preventing anything going wrong, and therefore doing everything you can to prevent it going wrong, versus a mindset that says something is going to go wrong. So how do I mitigate the impact of the things going wrong that I may not know about right now?

And that mindset is... it's almost like that's a generational thing in some ways, just because context and a lot of different things have changed and moved. But that is a really difficult change to make. If I've always been of the view that I can control everything and we can prevent anything going wrong, versus "If something goes wrong, how are we managing the consequences to minimize the downside?"

Peter: 15:22
And I'd go even one step further and take a page out of Sidney Dekker's book where he's talking about Safety-I and Safety-II systems. That to reduce the risk in the system, you emphasize what works and you reinforce and strengthen those parts of the system because you can't deal with the things that you don't know are going to happen. So all you can do is figure out what makes the system work and reinforce that and make that go really, really well.

And if one of the things you know that makes the system work is that you ensure that you have those operational processes in place, that's a part of making the system work.

Dave: 15:52
I think just... I might add one other thing. We've clarified that, but one other thing that we did touch on is that whole conversation and practices around hygiene - around the things that, you know, they're not always the new innovative things, but they're the basics we need to keep our work environment, our ecosystems, healthy and maintainable and things that are a pleasure to work in rather than incredibly frustrating and challenging to work in.

Peter: 16:22
Yes, yeah. And we want to remove the toil out of the system. We want to make these things easy to do, and you want the system to be fun to work with. So you want to automate as much of that as possible so that it's not a set of manual boring tasks that people have to do. Because people are very bad at doing manual boring tasks.

Dave: 16:46
Awesome. Well, thank you, as always, for the conversation, Dave. I really enjoyed it and I look forward to the next one. Until next time.

Peter: 16:49
Thanks again. Until next time. You've been listening to Definitely Maybe Agile, the podcast where your hosts Peter Maddison and David Sharrock focus on the art and science of digital, agile and DevOps at scale.

People on this episode

Dave Sharrock

Host

Peter Maddison

Host