It is a great pleasure to be at this conference and to get the opportunity to say a few words to open it. I’ll come back to why evaluation is so important in a minute, but I want to start with a story.
Last week, I received a pre-publication sent to me by a scientist that showed that vitamin D administered to an otherwise healthy population did not reduce the chance of catching COVID. It did not reduce the severity of the disease and did not stop other acute respiratory tract infections. This piece of work was really important because, if I go back to the dark days of 2020, there was a strong belief that every single person in the UK should be supplementing with vitamin D. There were theoretical reasons to think that vitamin D might be beneficial in COVID. That was resisted as a call and, instead of going forward with something that we didn’t know the effect of, a proper study was set up which was just read out.
That notion of not jumping into a policy choice without seeking some evidence base for it is obviously at the very heart of what this conference is. It has been incredibly important over the past couple of years during COVID. I’ve given you the vitamin D example, but there were many others where people had little bits of evidence or little bits of ideas that certain treatments would work and should be given quite widely across the population. Fortunately, a very big and well-run study was set up; the recovery study that looked at pretty much everybody coming into hospital and asked can they be randomised to different treatments or looked at people before they were in hospital and asked the question “does something prevent or help”.
We learnt quickly over the course of the pandemic which things did not work and, importantly, which ones did work. That was the reason that dexamethasone was selected early on as something that did actually have an effect; it reduced death and disabilities by about 25% from severe COVID. Because there was evidence, it was possible to then implement that very, very rapidly. So, it suddenly became the standard of care, not only in the UK but across the world. This idea that little bits of evidence can lead you in the wrong direction if you aren’t doing the right definitive study is absolutely well-established in that field.
By the way, everybody has been fooled. I challenge you to meet anyone across the space of trying to evaluate therapeutic interventions that hasn’t been fooled at some point by a series of seemingly very positive looking results from little observations, little ideas, little things that people thought could work. Then when the real test is done, you find out “no, they don’t work” or even they’re harmful.
A second example from the last couple of years is mass testing. There was huge enthusiasm for trying to get mass testing out, and a very important study took place in Liverpool where it was trialled across the population. What was important about that study was that not only did it show the benefits and difficulties of doing it, but it began to expose some of the unexpected, difficult operational matters. Mass testing wasn’t truly mass testing; there were parts of the population that it couldn’t reach. And there were some reasons that those people couldn’t be reached. They were cultural. They were demographic. Things like how close you were to the testing site became a really, really important determinant of whether this was going to be successful.
Masks are another example. We understand that they work but we don’t precisely know how well they work. Lockdowns are going to be studied for years to come. Which bits of lockdown worked? Why do they work? How did shielding operate and did it really protect the people it was supposed to protect? These are really critical issues which we will get answers to.
I wanted to start with those examples because they illustrate a number of general principles, which I think are important and relevant to other policy areas. How do we avoid causing harm? That’s one question we should ask ourselves. How might we accelerate doing good? The example there is the ability to get dexamethasone out quickly because we had an answer. How might we better understand the operational implications and how that might drive inequality or might reduce inequality or might advantage some groups? And that is the example of mass testing. And importantly, how do we also use evaluation to reduce uncertainty for the future? Wouldn’t it be nice to have gone into this pandemic understanding how best to operationalise lockdowns? Where they worked, where they didn’t work? What the effect would be on children, what the effect would be on the economy, what the effect would be on many other aspects of their implementation? Our duty now is to evaluate the policies that have taken place and really understand what that tells us for the future so that there is more of a guidebook for people to be able to use.
COVID has given us many examples of where evaluation has been and will be important and the Evaluation Task Force, when it was set up, had two aims: improve government understanding of what works and embed that in decision making.
You know, once it is said out loud, you have to ask yourself “what’s the opposite?”. If we don’t do that, what does it mean? I think it means choosing not to know. It means you go into something saying, “it’s OK, I don’t want to know”. And that can’t be right. Evaluation must be part of everyday thinking and must be part of our duties, as civil servants and public servants to try to make sure what we do is in the best interest of those people that we serve. This conference, therefore, is incredibly important. It’s very timely because this is totally aligned with the government reform agenda. And, for all the reasons that I’ve said, it is essential. It’s not something that, to my mind, is optional.
Where is the starting point? The starting point needs to be accepting that we don’t know and being comfortable with the fact that our starting point is very often one that contains big gaps in knowledge. Then the first step needs to be: how do I turn that gap into a definable gap? And that is where evidence synthesis becomes important. A key step in this whole process is getting an adequate view of what we really know and don’t know, and evidence synthesis is an established methodology. To do that, there are some principles that are important to adhere to. It needs to be rigorous. It needs to be transparent. You need to actually be able to articulate how you’ve gone about doing the evidence synthesis, what methodology you’ve used, what you’ve included, what you’ve not included. It needs to be inclusive because, very often, these questions are not sitting in a single sphere. They may cross disciplines, they may cross policy areas, they may cross groups. We need to make sure that we include the right groups in order to get the right information, and they need to be accessible. People need to be able to get the outcome of this and see the outcome so they can both adopt it and challenge it.
If step one is evidence synthesis, then step two clearly is going to be the design of whatever policy it is that needs to be designed. But the design can seldom be definitive without an evaluation step. We need to evaluate, and then we need to amend. That process of being flexible and being prepared to adjust as you go along has an analogy in the clinical trials world. The so-called adaptive design; where you know you don’t know all the answers at the beginning, you want to collect as much information as you can and modify what you do as you go forward. One of the few things that I’ve learnt over the years about how you think about trying to evaluate. Well, the first, and it’s sort of bleeding obvious but… but why don’t we just think about the outcomes? What is it that you’re really trying to achieve? What is the outcome that matters and to whom does that matter?
One of the things that we all need to do is to think about surrogate outcomes because, very often, the policy outcome may be many, many years off and you can’t always wait for the many years to get the answers. So, are there surrogates of the true outcome that reflect accurately the true outcome? Very much something to think about, but also something that is difficult to actually do, and sometimes it can mislead you as sometimes the surrogates aren’t good surrogates for the real outcome.
Second, when thinking about outcomes, which groups or populations does the outcome matter to? If you think about the mass testing example, it mattered to the population. It was a population-level outcome, but it mattered an awful lot to small groups and to individuals as well. So what does matter in your policy? Is it the total population average or is it, in addition to that, something you want to know about a subgroup or individuals? There are various methodologies to look at those. The individual, of course, is ultimately what does matter to us. You know we care about what happens to us, so I think that helps the notion of how you get into evaluation, not just at a population average level.
There’s also the question of how definitive you need your answer to be. Considering the vitamin D example I gave before; you could argue that if something is a completely harmless policy intervention, then its benefits might outweigh its risk. Whatever happens, and you don’t necessarily need the most massive definitive answer to your question, but you need enough of a guide. Other situations you might say: actually, there are unintended consequences here which could be really quite big. I’d like a much more definitive answer to this, so understanding how definitive your answer needs to be determines how you choose to evaluate.
Speed is always the issue, and in fact it’s usually the reason for not doing something. I want to know now and, therefore, I’m not going to evaluate – that I think is a big mistake. There are ways of getting quick answers. There are ways of looking at things as you go along. Don’t let speed be the enemy of this.
Bias is a big problem. You know, when you look at how results come about, you’ve got to really build into the fact that your intervention itself is changing what you’re looking at. Your bias as a designer or interpreter and the bias that’s inherent because people start to do things differently, and they may do things differently depending on how you set things up. Bias is important throughout all of this.
I’d like to leave one message in this. If you do evaluation openly and you’re clear with people, it is a massive builder of trust. One of the reasons, over the past two years, certain things have been possible that would have been very difficult to do otherwise is because of trust building. Evaluation is a key part of that.
As we think about evaluation, I’ve used some examples from the medical sphere. Let me move away from the medical sphere: Net Zero. We’ve got the most massive societal challenge around Net Zero, with many, many moving parts that are interrelated. Something that happens in one place may dramatically affect something that happens somewhere else. If we don’t evaluate that as we go along, we will end up making big mistakes and they will have knock-on consequences for other areas. One example of an evaluation tool, which I think is important, is the Office for National Statistics who have now set up a brilliant dashboard to look at all the outcomes for Net Zero. You might want to take a look at, so you can begin to have an integrated series of outcome measures as you begin to think about policy development in those areas.
Evaluation is going to be important. Are we really achieving what we tend to achieve and who are we achieving it for? Is it the whole population? Is it part of the population? Is it individuals we’re going to have to think about how to evaluate?
I want to end by giving an example from a colleague, the Chief Scientific Adviser in Japan, who told me about something that he’s been doing for the past several years. With the Treasury in Japan, they decided at their annual budget review that they would no longer accept any proposals for spending if they didn’t include a very clear, outcome-driven evaluation process. This was implemented a few years ago in some departments and they evaluated the effect of doing that. They found that the overall spend on evaluation, of course, went up. It went up a few percent. The overall spend in departments went down because they found out they were able to target their spend more effectively as a result of the evaluation. We do have some international comparisons telling us that evaluation does make a big difference in terms of the outcome you’re trying to get. They don’t yet know in the Japanese example if they have overall better outcomes but it’s pointed in the right direction and intuitively you would think that would make sense.
If we get this right this absolutely changes the way we think about policy development, but it has to be done in a way that’s iterative, agile, and appropriately targeted towards the outcomes and framing those outcomes is a key part of the process.