Definitely, Maybe Agile
Definitely, Maybe Agile
Experimentation
This week we dive into experimentation, what it is, its purpose, and how we often mess it up. How do you design experiments? What makes an experiment good? How might you measure success?
All this and more!
We love to hear feedback! If you have questions, would like to propose a topic, or even join us for a conversation, contact us here: feedback@definitelymaybeagile.com
Peter (0:07): Welcome to Definitely Maybe Agile, the podcast where Peter Madison and David Sherrock discuss the complexities of adopting new ways of working at scale. Hello, welcome! Dave and I are here again for another exciting talk. So what's the topic today, Dave?
Dave (0:23): Experiments, I believe—or experimentation, maybe. Let's see how that goes... which effectively this podcast has been until now.
Peter (0:30): Yeah, well, I was kind of experimenting with the introduction there to see if I could mix it up a little. So where do we want to start on this? It's a broad topic. I could talk about this for quite a while. We're going to try and cram it into like 15 to 20 minutes.
Dave (0:45): I find it so peculiar because the industry has a tendency of leaping on a phrase or leaping on a term, and "experiments" is that one. I find so many conversations where being able to throw everything to one side and just try any idea is treated as "let's run an experiment," and then whatever the idea is comes up and that's supposedly the experiment. And it drives me wild because that isn't how an experiment is formed. There's a lot more to it than that. What about your experience with experiments or the phrase "let's run an experiment"?
Peter (1:26): My experience with it is that people say that they want to do this, but they are very, very bad at actually doing so. And what they consider to be an experiment is something that's set up that can never fail. So it's not really an experiment—it's a "hey, we're going to go here, let's experiment with a path forward," but whatever you do, don't not succeed.
Dave (1:49): Well, it's to try something new, right? "I'm going to try something new. Therefore, that must be the experiment." And the intention... Going back to minimum viable products, the whole concept there is let's validate whether the hypothesis we have is a good one or a bad one. So let's try testing that hypothesis by running an experiment. Designing an experiment has been lost or ignored in the rush to just try something new as a minimum viable product or however we want to look at it.
Peter (2:30): I mean, I think there's an element of—maybe it's just not that easy to do, too, as well as... We find ourselves saying we throw the word around like "experiment, just go experiment with that." But there's a lot of things holding people back. We've set people up through education, through the way they've been asked to work, through what they've been told to do, that experimentation isn't necessarily very natural in the way that we behave or in how we approach problems. So where does it all come from anyway? Why do we care about experimentation? And I'll let you have a kick at that one.
Dave (3:10): Well, I mean, I have a background in academia and a PhD in geophysics and seismology, so I'm all about running experiments. And the reason I run experiments, or I'm interested in experiments, is to learn stuff. That's how we learn. It's that whole empirical feedback loop. We learn by trying something, failing at it, adjusting our approach, trying it again. Failing at it, adjusting our approach, trying it again, maybe succeeding. So, in a nutshell, experimentation allows us to learn.
If we look at the Renaissance, where the scientific method really leaped on as a way of seeing the world around us, the Renaissance led to some of the most dramatic learning curves, if you like, as a result of that—whether it was in the sciences, arts, humanities, wherever it might be. So for my mind, experimentation is always associated with learning and is a powerful tool, just for that very reason.
Peter (4:17): Yeah, exactly. And it's one of these problems that relates back... So when we're talking about it in terms of organizational change—which is generally what we like to talk about on this podcast—the reason experimentation comes up so much is that we need to be able to learn what's going to work for us, what's going to work for this organization, what's going to work in the circumstances and the context and in the universe that we've created for ourselves within the organization. How are we going to find out what the right way to work together is?
And there's this belief that everybody's out there looking for a silver bullet. We're going to pick the framework off the shelf, we're going to slot it in, and we're going to do everything verbatim and now we're done. We don't need to do anything else.
Peter (5:06): It's like we've kind of learned everything we could possibly learn and we're here. And there are some psychological pieces around where that comes from and what it is that causes some of that. But it's really that piece where you need the experimentation so that when you brought this in, you can say, "Okay, but what might we try next? Or how else might we get better? Are these things actually still serving us? Is there other ways in which we might be able to approach this, and what experiments might we run to find out what those are?"
Dave (5:37): Maybe can I ask you to follow up a little bit? Because we're talking a lot around the experimentation piece, and there's obviously a lot of history and depth into that that we can go into, but maybe describing or defining what an experiment is. So, Peter, what do you see? If I come to you with something and call it an experiment, what are you expecting?
Peter (5:59): Well, it depends on what we're experimenting with, of course, but let's take a really, really simple one related to ways of working. Let's say you've got an organization—and I'll pick something at the team level. An organization has decided they're going to go out and they're going to adopt Scrum, and they decide Scrum's got all of these different practices. One of those practices is daily stand-up. Everyone who's had anything to do with the Agile space will be familiar with this and loathe it, possibly, or love it, depending on how effective it's been for their ways of working, which is exactly what we're talking about.
Now, say your team's working there, they're doing daily stand-ups, and then they say, "Okay, if we're looking at this, is this actually serving us? Is this helping us?"
Peter (6:46): And from a coaching perspective, "Well, why don't we experiment with not doing this anymore? And let's see what happens. Let's see what happens to our metrics and the way that we're working together if we don't do this for a while, and see what happens from that." And we'll measure it based on how we're measuring our performance, how we're measuring communication. Are we still getting as much done? Are things falling through the cracks if we don't have this meeting?
And so we're experimenting. We've laid out what we're going to do. We set a hypothesis that, "Hey, we don't need this"—and I'm partially picking this one because we spoke about it a little bit on one of the previous podcasts—and we're laying that out and we're setting criteria to be able to measure whether or not we succeeded or failed, whether this was a good experiment or a bad experiment, and then from that we can make a decision about what we're going to do next.
Dave (7:36): But I just want to... I don't think there is anything like a "good experiment" or a "bad experiment," and I think that's the thinking that we want to try and challenge a little. Yes, the outcome of the experiment may prove a point or prove that that wasn't the right way to go about it or there's more for us to learn, but I don't think that's tied to whether it's a good or a bad experiment. All of the experiments, whatever the outcome, is going to be valuable. I think a good experiment or a bad experiment is how well it's formed, and you described a really nice...
There are two things that we need to have working for us going in. One is a control. I need to compare it to the way we performed last week, or if I'm investing money in a different way, I'm going to invest money in something that's considered a safe asset as a control, to be able to see whether or not the other decisions I'm making are better or worse for me in the short or near term. So that control aspect is important.
But the other one that's ignored in many cases, as well as that control, is the hypothesis.
Dave (8:50): So I just had a conversation this week about a very trivial thing of placing a button on a website. And I remember having that conversation and we were talking about, "Well, we should just go and A/B test it." So A/B testing the placement of a button on a website makes a lot of sense. But then I foolishly said the idea would be we'd have a map of conversion, effectively, of the button wherever we placed it on the homepage. And I was just talking if that makes sense.
But the interesting thing there is that's not what we're going to do. Instead, we're going to go in and say, "No, hold on, it doesn't make sense for us to put the button on top of an image. It doesn't make sense for us to put the button, you know, right at the bottom of the page maybe." So we're looking at the model, the hypothesis we have, which is a button is going to work better maybe if it's near the navigation or if it's on the left-hand side or if it's in the focus.
So there's two or three things that we can look at already, and we can ignore a whole bunch of other areas as well. So that hypothesis that we come in, that model of behavior, is really important for us, coupled with some sort of control. I've got to be able to compare it with something to say was it better or worse than not running that test? If that makes sense, anything you would add to that, Peter?
Peter (10:18): No, and I do agree with you—there's no bad experiments unless somebody's going to get physically harmed in the process, and that might possibly be bad. But then I suppose it depends on what you're testing then too. Yeah, so that piece around properly designing it, making sure that you have a well-designed, well-thought-out experiment, that the hypothesis is good or well-structured, that we know how we're going to measure what the outcome is and how we'll make a decision from that—those are kind of the main things that we would be looking for.
What do you think some of the main mistakes people make when putting experiments together are?
Dave (11:04): I think you touched on a really important one, which is thinking that's a one-off, right? And in too many cases, the only choice that we're going to pursue is called an experiment. When it's not an experiment, it's a sort of all-or-nothing dive for the finishing line. And so one of the challenges that we have when we're working in different environments to run experiments—and when we talk about organizational change in particular, this is an example—is how do we get the cost of running a test or an experiment to be as small as possible?
Because if we look... I mentioned A/B testing earlier on. Nowadays there are tools, and Google has done a lot for us in terms of creating this environment where we can run tests very, very, very cheaply. And when we can run tests very cheaply, the conversation becomes "What's the next test we should run?" But when experiments are really expensive, we may call it an experiment, but in many cases we just have to run that one course of action and we don't have the opportunity to either change course or, perhaps more importantly, we don't have the opportunity to try different structures out, different options.
Dave (12:30): So if I come back to the organizational change or organizational structure example, very few organizations are going to be able to run in parallel two or three different organizational structures. It's just unfeasible. It's either very, very expensive or it's just... Unless you're incredibly large as an organization, it's really very difficult to do something like that. And in the same way, we can't run one for three months and go, "Yeah, you know that organizational structure doesn't work as well as we thought. Maybe we should run another." It's much more difficult and subtle than that.
So I think that I'd say, on the one hand, the thing that concerns me, if you like, is that we have to understand how to make the cost of experiments as small as possible, or at least recognize that the cost of running an experiment is going to influence our ability to successfully try different options.
Peter (13:29): Yeah, and I think that's a very good point, although I can think of some large organizations where restructuring every few months seems to be the norm and it has exactly the effect you'd expect. The kind of interesting ideas are there around... So, like, design thinking has that concept in it around "We're going to run an experiment and we're going to prototype, but we're going to set a price limit on what it should cost to be able to do that." So "I want you to design an experiment when you've got a hundred dollars and I want you to prove out this idea."
Peter (14:02): And so, for example, if we're experimenting with a new product design and we want to see if people would be enticed by a particular type of product, something that—say, I don't know—there was one the other day like makes your... makes your phone stick to the wall, or something like that. Design some leaflets, describe what it is. Take them down, walk out onto the street and stop some people and ask them and show them what it looks like. Or I might—given these days of COVID, maybe not—but I might create a little video, throw up a landing page and direct people to it and say, "Hey, if you had this, would you want it? Would this be useful to you?"
That kind of thing, which is a very cheap, very easy-to-run experiment to get quick feedback to learn about what people might want.
Dave (14:51): Yeah. So when you're doing that—and this brings us to the next piece—because you've got a bit of a control, like "I went out last week with a leaflet and I got a certain set of feedback. If I do the same this week, my hypothesis is I should get different feedback," whatever that might be. But that brings us back to the predictive nature of experiments and knowing when the experiment succeeded or failed, in the sense that it met our expectations or it didn't. And this means that a hypothesis on its own isn't any good. We have to have a hypothesis which makes a prediction.
Dave (15:31): So if I come back to that simple button on the web page, my hypothesis was there's only three or four places on the web page where a button makes sense, and one of those will have a higher click-through. So if there's a preference, we should see more click-throughs when that button is in one position compared to another. That's the definition of an A/B test immediately. So I can now look at that, the data that comes back. But there's another aspect: what if there's around 400, 450 clicks wherever the button is? So there's no way of differentiating which one...
So we don't only have to make a prediction, we also have to understand that from that prediction there's room for error, and we need something that is substantially better or substantially different to the different options that we're looking at.
Peter (16:29): Yeah, to be able to... And you see this a lot in the data space and the data science space too, where we're testing models and we're trying to understand what... and we have a hypothesis and we're testing it against data. And it's how we typically take data, we split it and then we have a test set that we then run it against to validate whether the model works or not. And you've got to have enough data and the data itself can be intrinsically biased in those cases. But that's going to take us down a whole other avenue of conversation, I think, and we're at 17 minutes now. So see, we did manage to talk about this quite easily for 15 to 20 minutes, as I thought we would. Excellent.
Dave (17:07): So maybe, Peter, if we summarize what we've talked about:
One aspect is experimentation is about learning. It's about that continual cycle of getting better at what we do. Number one.
Number two: experiments require a control of some sort, something to compare to so that we can decide if the behavior is better or worse than what we had as a default in the past.
Number three: it needs a hypothesis of whatever it is that we're testing.
And number four: some sort of appreciation of uncertainty and error in the results. So maybe we think of that as the hurdle which a test or an experiment has to overcome before we consider that we should therefore change our behavior as a result.
Would that be right? Anything you would add to that, Peter?
Peter (18:00): Only, there are no good or bad experiments. They're just experiments and things that we learn from, which we were talking about. But yeah, I think that's a great summary. So, thank you very much. As always, I hope our listeners enjoy this too, and yeah, thank you—look forward to the next one.
Dave: Yeah, thanks again, Peter, always a pleasure.
Outro: You've been listening to Definitely Maybe Agile, the podcast where your hosts, Peter Madison and David Sharrock, focus on the art and science of digital agile and DevOps at scale.