Definitely, Maybe Agile

AI in the Real World, Not the Demo

Peter Maddison and Dave Sharrock Season 3 Episode 211

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 35:55

Most conversations about AI focus on what it can do in a controlled setting. This one doesn't. Callum Sharrock spends his days deploying AI systems in real environments, watching them succeed and fail in ways no simulation predicted, and reporting what he finds. His conclusion? The trend line is steeper than most people realize, and snapshot thinking is getting a lot of organizations into trouble.

Peter Maddison and Dave Sharrock dig into why reliability, not capability, is the real adoption bottleneck right now. They talk through what happens when non-deterministic models get applied to problems that need deterministic answers, why validation and testing are becoming more important than writing the code itself, and how the calculus around decision making is changing fast. If you can build and test something in the time it takes to debate whether to do it, the meeting starts to look like the problem.

They also get into what this means for developers, for leaders, and for anyone trying to figure out where to actually invest their energy right now. The barriers to building have never been lower. That makes the question of what to build more important than ever.

This isn't a conversation about AI hype. It's about what's actually happening at the frontier, and what it means for the way organizations make decisions.

This Week's Takeaways:

  1. The barriers to building have never been lower - figuring out what's worth building is now the real work
  2. Leadership is shifting toward agency and rapid decision-making, away from top-down strategy setting
  3. If you can run the experiment in the time it takes to schedule the meeting about it, run the experiment

If this episode resonated, follow Definitely Maybe Agile wherever you listen to podcasts so you never miss a conversation. And if you know someone spending two hours debating whether to test an idea they could just build, send this one their way. There are plenty more episodes worth your time at definitelymaybeagile.com.

Welcome And Guest Introduction

Peter [0:04]: Welcome to Definitely Maybe Agile, the podcast where Peter Maddison and Dave Sharrock discuss the complexities of adopting new ways of working at scale. Hello, and I'm here with Dave and Callum today. We're going to have another interesting conversation, I hope. So I'm going to start by introducing our special guest, Callum Sharrock, who's going to talk to us about his journey and some really interesting work he's been doing around AI and robotics. Callum, why don't you tell us a little bit about yourself?

Callum [0:35]: Sure. My name's Callum, and I study computer science at the University of Toronto. I've been working in the robotics and AI space for a while now. Previously I was at Tesla, designing the cleaning robot for the Robotaxi program. More recently I've shifted focus toward AI safety evaluations - specifically how AI behaves in real-world deployments. I'm currently doing that at Arden Labs.

Peter [1:01]: That's a pretty big shift. What pulled you from Tesla toward AI safety?

Callum [1:11]: After graduating, I took a few months to actually think about what I wanted to do. Which, honestly, more people should do. A lot of people just kind of end up doing whatever they think they're supposed to do without really stopping to ask why.

I did a bunch of research and kept coming back to AI safety as an area that felt genuinely under-discussed. That surprised me, given how much emphasis there is on capabilities. The more I dug into it, the more I thought: not enough people are looking at this seriously, and it's also kind of fascinating work. You get to take frontier systems that have never existed before, and try to figure out every way they could go wrong - so that when they do get deployed, there's at least some confidence that they're being deployed safely.

Dave [2:19]: Can you expand on the robotics side a bit? You've got quite a history there too. How did you bring that together with the safety and evals work?

Callum [2:32]: Sure. In high school I was in a program that let me skip half my classes to fiddle with electronics and robotics, which was great. I made everything from video games to basic electronic circuits. I loved robots. I loved building things that existed in the physical world, not just on a screen.

Then I accidentally ended up in computer science at university instead of engineering - didn't really do my research on that one. But timing worked out, because right as I'm sitting in theory classes wishing I was working with robots, reinforcement learning and AI methods were starting to get applied to robotics. I gave myself a rule: only look at internships that involved robotics.

First internship was at a company in Berlin doing quality assurance detection - deep neural networks for welding defect identification. Then painting robots. If you drew something online, we could use a robot to repaint it in actual acrylic on canvas. After that, Tesla. Then research at the Vector Institute in my final year on reinforcement learning for compute-constrained robots - basically, how do you make low-resource robots work faster and more effectively?

Peter [4:19]: Which one was your favourite?

Real-World AI Evaluations Explained

Callum [4:21]: Honestly, what I'm doing now. And it's because of something I kept noticing with robotics - what I actually enjoy is work that's deployed and interacting with people and things in the real world. It's easy to get very theoretical, but at some point, it doesn't really matter if it doesn't happen out there.

There's also a lot that happens in the real world that you simply can't simulate. What we do now is figure out different ways to deploy AI systems, evaluate how they actually perform in real conditions, as safely as we can, and then report those findings publicly. The goal is transparency - both around capability timelines and around the safety risks that come with them.

Peter [5:05]: So for someone who's outside that ecosystem - someone who just opens ChatGPT and types things in - what do you wish they understood better?

Callum [5:19]: There's this thing called Moravec's Paradox, which basically captures the idea that we overestimate what AI can do in a year and massively underestimate what it can do in ten. I think a lot of people use whatever AI assistant they have, hit one case where it gets something wrong that they could easily do themselves, and write it off.

But if you look at the trend line - three years ago, these models were pretty rough in a lot of ways. Today they're genuinely good at a wide range of things. New models are solving IMO problems, discovering undiscovered mathematics. The trajectory is steep, and it's not linear. People tend to take a snapshot of where things are today and assume that's roughly where they'll stay. That's not what the data suggests.

Dave [6:25]: That's a classic change management problem, actually. When you threaten to change someone's work environment with new technology, the immediate response is to go find every place it fails. "Look, it got this wrong. Look, it can't do that." Have you seen examples where people successfully flip that around and focus on where it genuinely excels?

Callum [7:00]: That's partly why I'm excited about the work we're doing now. We've been running this vending machine experiment for a while - we give an AI system control of a vending machine, it decides what to stock, how to price things, and interacts with people over Slack. As new model versions come out, you can track how their performance changes over time. The trend is steep.

There's also a metric that gets shared a lot in the AI community - a graph tracking the time horizon over which an AI can complete a coding task autonomously. It's gone from a few minutes to a few hours pretty quickly. And these improvements aren't linear, they're exponential. Which, as we saw with COVID projections, is genuinely hard for people to process intuitively.

Dave [8:50]: With robotics, do you see the same trajectory, just a bit slower? Or is it roughly mirroring the LLM development curve?

Callum [9:00]: The common phrase in robotics is that it hasn't had its ChatGPT moment. That's probably fair. My somewhat unpopular take is that robotics is now a software problem, not a hardware problem. My evidence for that: if you teleoperate a robot, you can make it do basically anything. Which tells me the hardware is largely capable enough. The bottleneck is the algorithm.

And as AI improves, those algorithm problems are going to get significantly easier to solve. Robotics has a feedback loop challenge - a biology experiment might take months to return data, software is nearly instant, and robotics sits somewhere in between. But improvements in software are going to speed up all of that too.

Peter [9:59]: There's a lot of learning happening fast. And if you look at something like Chinese humanoid robots - comparing footage from one year to the next, the progress is remarkable. I mean, literally leaps and bounds. Backflips.

Dave [10:37]: It is unnerving, some of those before-and-after comparisons. The change is remarkable.

Peter [10:47]: So at what point do you think AI starts genuinely replacing developers? That's where a lot of the investment has gone in terms of testing AI capabilities. How fast do you think that shift actually happens?

Callum [11:10]: It's a hard one to answer. And I'd push back slightly on the framing - I don't think coding is necessarily where the most investment has gone. I think it just gets a lot of attention because programming is a highly verifiable domain. Logic A implies B implies C. The feedback loops are clean. That makes it relatively easy to train on and measure. The same is true for accounting, tax work, mathematics.

Big companies are citing huge numbers now - Anthropic saying 90% plus of its code is AI-written, Google somewhere above 50%. Those numbers will keep climbing. But what that means for developers is genuinely unclear. Does it make developers more valuable because they can do far more? Or does it reduce demand? I've heard compelling arguments both ways. My personal answer is: I write much less code than I used to. I spend more time validating code and providing feedback on what gets generated.

Peter [14:45]: The language starts to matter a lot less when you have a system that can translate into whatever you need. It does vary when you're working against an existing codebase in an unusual language where the model has less training data - the answers get less reliable. And you also run into the person running the mainframe who simply doesn't feel the demand yet. They don't see the pressure from where they sit in the organization.

Dave [13:01]: It's the classic consulting answer - it depends. There are codebases you could probably just point tools at and rebuild. But there are also problems where human cognitive experience of the system still matters a lot. And we've talked about this, Peter - retooling an industry isn't just SDLC going faster. It's asking whether you even need that process the way it currently exists.

Validation, Testing, And Better Requirements

Callum [15:30]: And there are always those edge cases. But a good example of the core shift is Excel on the recorder being rewritten in Rust and Python relatively quickly from the original codebase. Moving codebases, as problems go, isn't actually that hard - you generate tests to understand how the current code behaves, generate a new codebase, and if the same tests pass, you're largely good. So the really important thing becomes validation and testing. In my own development work now, that's basically all I do - how do I test that this works for all the use cases I actually care about?

Peter [16:23]: I'd agree. In the systems I'm building, I'm not really looking at the code. More and more people I talk to say the same thing - what matters is whether the outcome you're getting matches the outcome you intended.

Dave [16:46]: Which gets quite interesting, because now you can explore the behavior of a system under stress and dig into edge cases much more easily than before. Whereas previously, that was some poor person manually validating everything and somehow finding time for smoke testing when nobody was pressuring them.

Peter [17:04]: And here's something kind of funny - developers have spent decades asking for decent requirements and rarely getting them. Now I'm watching developers actually build out the context themselves. The tooling is almost demanding that they create clearer architectural definitions and requirements, because that's what the agents need to build the right thing. It's a strange inversion.

Dave [17:57]: I've got a metaphor, Callum. You're describing work where you're in the cockpit figuring out how to navigate the track - not worried about how the engine itself works. Whereas in the past, there was a lot more building of the car and less focus on the road ahead.

Callum [18:22]: That makes sense. It's like moving from C to Python - a higher abstraction layer. In Python, you don't need to worry much about how the computer manages memory for most use cases. The tricky question is knowing when you do need to go lower. A Formula One driver still needs some understanding of how the car works to give meaningful feedback. I think that's still true here - having some depth in the layers below helps you understand why problems surface the way they do. How long that remains necessary? Genuinely hard to say.

Reliability As The Adoption Bottleneck

Peter [19:30]: And at some point, if we stop caring about the language, maybe we just let it write in assembler. If we don't need to read it, why not write it in something faster? Anyway - what are the big things you see actually holding AI and robotics back right now?

Callum [19:54]: Reliability is probably the biggest one. And I struggle with this because I tend to assume people are operating correctly - I like to think when I go to the doctor, they know exactly what they're doing. But the Waymo and Robotaxi story is a good illustration. We had cars doing 100 kilometers autonomously in the DARPA challenge around 2008, 2009. The basic technology was largely there. It's taken another 15 to 20 years to get to where we are now - semi-deployed, still limited. Most of that time has been spent improving reliability to a level where society is comfortable enough with it.

You can say something similar about AI in software. For a weekend project, spinning up a few tools will get you something you're happy with. But if my bank is now letting an AI make decisions about my savings account, I want considerably more validation first. So reliability - and finding robust ways to verify it - is the core challenge. Robotics faces the same thing, just a bit further out. Models can probably run an espresso machine for a day now with some light intervention. But transfer that to laundry folding and it's not the same model picking up a new task cleanly. The training is still somewhat task-specific. And within a task, the reliability isn't yet at the point where you close your eyes and forget about it. That's what people are tracking timelines on.

Dave [22:47]: There's a secondary piece on the autonomous vehicle thing that strikes me. My understanding is these cars are already statistically safer than human drivers. And yet society isn't ready to allow them on the road - not because they're not safe, but because we haven't figured out what that means legally, for insurance, for accountability. It's less of a technology problem and much more of a legal, philosophical, and social challenge.

When Deterministic Beats AI

Callum [23:55]: True. And there's an interesting angle there - it's not obvious we have to design these systems to fit the infrastructure we currently have. There's a version of this where you have a dedicated highway for autonomous vehicles only. You can reduce the unknowns significantly. And in general, the pattern we try to follow is: deploy in safe environments where failure is contained and informative, report back what we learn, and help inform public understanding of both the capabilities and the risks.

Peter [24:55]: Do you see cases where trust gets damaged because people are applying non-deterministic models to deterministic problems? For example - and I hope no one is actually doing this - having an AI model decide whether to apply the brakes on a car. If I tell a system to copy a file from point A to point B, I don't want an LLM in the middle of that. I want bytes moving reliably. The LLM is a cost with no upside there. I see this a lot - people sprinkling AI on everything, including places where a simple, predictable script would do the job better and cheaper.

Callum [26:32]: Really interesting question. There are two parts to it. One is just choosing the wrong abstraction level - I saw a project where someone used an LLM to figure out what the current date was. The date comes in five formats. Parsing it is a solved problem. That's just not the right tool.

But the broader point is worth thinking about carefully. I'd argue the right structure is: deterministic controls, non-deterministic orchestration. That's roughly how humans work too. Doctors are non-deterministic - and they make mistakes. But the evidence increasingly suggests AI systems may soon make fewer mistakes than human drivers, or human diagnosticians in certain areas. The stats for autonomous vehicles are already orders of magnitude better than human drivers on many metrics. The rest is edge cases worth solving, plus a legitimate social and political adjustment that takes time. As for whether to trust AI in higher-stakes contexts - I think that comes back to evaluation. There may be a near future where a model can build something like AlphaGo. At that point the abstraction keeps moving up, and these systems get better at determining where deterministic or non-deterministic approaches make sense.

Skills That Matter In AI Teams

Dave [28:40]: What strikes me as you describe all this, Callum - and I'm looking at Peter here too - is how much of the traditional framework for how business runs and how technology gets built is genuinely up for grabs. Not as a hypothetical anymore. If a system can go and build an AlphaGo equivalent on request, a lot of the stepping stones we've spent years navigating look quite different. Let me turn that into a question: what are the skills that actually have value in the world you're working in?

Callum [29:58]: The classic answer - and I do think it's right - is agency. Things are getting easier to do technically, and the pace is accelerating. So the real question shifts to: can you figure out which thing is worth doing? Strong prioritization. Blinders on anything that isn't the actual problem. Get the data as fast as you possibly can. Ask the most important questions first. De-risk things quickly, then move.

Peter [30:54]: You realize that's what we've been working on for decades, Dave.

Dave [31:02]: Right. We can just keep pointing it out.

Peter [31:15]: The other side of agency - especially in large organizations - is that a lot of people have had it trained out of them. Years of needing sign-offs and waiting on other parts of the organization to get you information. These tools change that equation. Someone with genuine agency can now gather information, do analysis, build things, and test ideas without depending on a chain of approvals to move.

Three Takeaways And Closing

Callum [32:12]: That's a big one. The calculus has changed on what's worth debating versus just doing. Something that used to take two weeks now takes a fraction of that time. A two-hour alignment meeting to decide whether to run an experiment starts to look pretty silly when you can just run the experiment in the same amount of time.

Peter [32:48]: I might regret saying this, but imagine what you could have built while listening to this podcast.

On that note - we always like to wrap up with one takeaway each. Callum, what would you want the audience to walk away with?

Callum [33:16]: The barriers to building things have never been lower. So figuring out what to build - and then actually going and building it - has never been more important. That question of what to build is where the real work is now.

Dave [33:34]: I'm going to pull on the thread we got to near the end. Callum, it's been genuinely fascinating hearing what you're up to and what you're seeing around you. What strikes me from this conversation is how the role of leadership shifts. It moves toward agency, rapid decision making, and - this is interesting - less toward being in control and setting the strategy. Because if you can prove something in the marketplace by building it quickly, the whole model of top-down strategy changes. There's a real gap there worth exploring.

Peter [34:22]: That's a good point. It's the inversion of the pyramid we've been talking about for years. But AI is something that actually makes it possible now in ways it wasn't before. Thanks very much, Callum. Really enjoyed the conversation. And Dave, as always. Until next time.

Dave [35:39]: Callum, thank you so much. Peter, always a pleasure.

Peter [35:43]: You've been listening to Definitely Maybe Agile, the podcast where your hosts Peter Maddison and Dave Sharrock focus on the art and science of digital, agile, and DevOps at scale.