My suspicion of AI in healthcare and everywhere else.

AI – it’s everywhere: there every time a politician pronounces on how to transform productivity in all industries, healthcare included, each time you open a newspaper or watch TV, in conversations over coffee, in advertising and culture. AI, however ambiguously defined, is the new ‘white heat of technology’.

In her excellent book ‘Artificial Intelligence: A Guide for Thinking Humans’ Melanie Mitchell discusses the  cycles of AI enthusiasm, from gushing AI boosterism to disappointment, rationalisation or steady and considered incorporation. She likens this cycle to the passing of the seasons – AI spring followed by an inevitable AI winter. The recent successes of AI, and in particular the rapid development of large language models like ChatGPT have resulted in a sustained period of AI spring, with increasingly ambitious claims made for the technology, fuelled by the hubris of the ‘Bitter Lesson’ – that any human problem might be solvable not by thought, imagination, innovation or collaboration but simply by throwing enough computing power at it.

These seem exaggerated claims. Like many technologies, AI may be excellent for some things, not so good for others, and we have not learned to tell the difference. Most human problems come with a panoply of complexities that prevent wholly rational solutions. Personal (or corporate) values, prejudices, experience, intuition, emotion, playfulness and a whole host of other intangible human traits factor into their management. For example AI is great at transcribing speech (voice recognition) but understanding spoken meaning is an altogether different problem laden with glorious human ambiguity. When a UK. English speaker says “not bad” that can mean anything from amazing to deeply disappointing. 

In our work as radiologists we live this issue of problem misappropriation every day. We understand there is a world of difference between the simple question ‘what’s on this scan’ and the much more challenging ‘which of the multiple findings on this scan is relevant to my patient in the context of their clinical presentation and what does this mean for their care’. Thats why we call ourselves Clinical Radiologists, why we have MDT meetings. Again, what seems like a simple problem may be, in fact, hugely complex. To suggest (as some have) that certain professions will be rendered obsolete by AI is to utterly misunderstand those professions, and the nature of the problems their human practitioners apply themselves to.

Why do we struggle to separate AI reality from hubristic overreach? Partly this is due to inevitable marketing and investor hype, but I also think the influence of literature and popular culture is has an important role. Manufactured sentient agents are a common fictional device: from Frankenstein’s Monster via Hal 9000 to the Cyberdyne T800 or Ash of modern Science Fiction. But we speak about actual AI using the same language as we do these fictional characters (and they are characters – that’s the point), imbuing it with anthropomorphic talents and motivations that are far divorced from today’s reality. We describe it as learning, as knowing, but we have no idea what this means. We are beguiled by its ability to mimic our language but don’t question the underlying thought. In short, we think of AI systems more like people than like a tool limited in purpose and role. To steal a quote, we forget that these systems know everything about what they know, and nothing about anything else (there it is again: ‘know’?). Because we can solve complex problems, we think AI can, and in the same way.

Here’s an example. In studies of AI image interpretation, neural networks ‘learn’ from a ‘training’ dataset. Is this training and learning in the way we understand it? 

Think about how you train a radiologist to interpret a chest radiograph. After embedding the the routine habit of demographic checking, you teach the principles of x-ray absorption in different tissues, then move on to helping them understand the silhouette sign and how the image findings fall inevitably, even beautifully, from the pathological processes present in the patient. It’s true that over time, with enough experience, a radiologist develops ‘gestalt’ or pattern recognition, meaning they don’t have to follow each of the steps to compose the  report, they just ‘know’. But occasionally gestalt fails and they need to fall back to first principles.

What we do not do is give a trainee 100,000 CXRs, each tagged with the diagnosis and ask them to make up their own scheme for interpreting them. Yet this is exactly how we train an AI system: we give it a stack of labelled data and away it goes. There is no pedagogy, mentoring, understanding, explanation or derivation of first principles. There is merely the development of a statistical model in the hidden layers of the software’s neural network which may or may not produce the same output as the human. Is this learning?

In her book, Mitchell provides some examples of how an AI’s learning is different (and I would say inferior) to human understanding. She describes ‘adversarial attacks’ where the output from a system designed to interpret an image can be rendered wholly inaccurate by altering a single pixel within it, a change invisible to a human observer. More illustratively, she describes a system designed to identify whether an image contained a bird, trained on a vast number of images containing, and not containing, birds. But what the system actually ‘learned’ was not to identify a feathered animal but identify a blurred background. Because it turns out, most photos of birds are taken with a long lens, a shallow depth of field and a strong bokeh. So the system associated the bokeh with the tag ‘bird’. Why wouldn’t it without a helping hand of a parent, a teacher or a guide to point out its mistake.

Is a machine developed in this way actually learning in the way we use the term? I’d argue it isn’t and to suggest so implies much more than the system calibration actually going on. Would you expect the same from a self-calibrating neural network as a learning machine? Language matters: using less anthropomorphic terms allows us to think of AI systems as tools, not as entries.

We are used to deciding the best tool for a given purpose. Considering AI more instrumentally, as a tool, allows us the space to articulate more clearly what problem we want to solve, where an AI system would usefully be deployed and what other options might be available. For example, Improving the immediate interpretation of CXRs by patent facing (non-radiology) clinicians might be best served by an AI support tool, an education programme or brief induction refresher, increases in reporting capacity or all four. Which of those things should a department invest in? Framing the question in this way at least encourages us to consider all alternatives, human and machine, and to weigh up the governance and economic risks of each more objectively.  How often does that assessment happen? I’d venture, rarely. Rather the technocratic allure of the new toy wins out and alternatives are either ignored or at least incompletely explored.

So to me, AI is a tool, like any other. My suspicion of it derives from my observation that what is promised for AI goes way beyond what is likely to be deliverable, that our language about it inappropriately imbues it with human traits, and that it crowds out human solutions which are rarely given equal consideration.

Melanie Mitchell concludes her book with a simple example, a question so basic that it seems laughable. What does ‘it’ refer to in the following sentence:

The table won’t fit through the door: it is too big.

AI struggles with questions like this. We can answer this because we know what a table is, and what a door is: concepts derived from our lived experience and our labelling of that experience. We know that doors don’t go through tables, but that tables may be sometimes carried through doors. This knowledge is not predicated on assessment of a thousand billion sentences containing the word door and table, and the likelihood of the words appearing in a certain order. It’s based in what we term ‘common sense’. 

No matter how reductionist your view of the mind as a product of the human brain, to reduce intelligence to a mere function of the number of achievable teraflops per second ignores that past experience of the world, nurture, relationships, personality and many other traits legitimately shape our common sense, thinking, decision making and problem solving. AI systems are remarkable achievements, but there is a way to go before I’ll lift my scepticism of their role as anything other than as a tool to be deployed judiciously and alongside other, human, solutions.

Parachutes, belief and intellectual curiosity

“There is no evidence that jumping out of a plane with a parachute improves outcome”

“If you go to PubMed, you will find no publications that breathing air is good for you”

“There’ll never be a trial, we are beyond that”

Have you ever heard these statements made when someone discusses the evidence about a particular new (or old) therapy? The statements might be true but are they useful? Do they advance an argument? What do they mean?

A paper from a Christmas British Medical Journal in 2018 found no evidence of benefit of parachute use when people jumping out of an aeroplane (which happened to be stationary at ground level) were randomised to wearing either a parachute or an empty North Face rucksack. This evidence built on a 2003 systematic review which found that there was no randomised evidence on the usefulness of parachutes for high altitude exits. Both these articles have a somewhat tongue-in-cheek style, but make the point that “…under exceptional circumstances, common sense must be applied when considering the potential risks and benefits of interventions”.

It is self evident that wearing a parachute when jumping out of a plane in flight, or of being in an atmosphere with enough air to breathe, is good for you. When people quote arguments about parachutes or air (or similar) in response to a query about a lack of evidence for a particular intervention, they are implying that the intervention they are discussing is similarly self evidently safe, effective or cost effective, that common sense must be applied. 

The issue is that the benefits of most medical interventions are clearly not in this category. To give some examples from my own field, it is not self evident that dosimetric methods will improve the outcomes of selective internal radiation therapy sufficiently to make a difference to trial outcomes, that endovascular intervention for acute deep vein thrombosis improves long term outcomes compared with anticoagulation, that for complex aneurysms endovascular aneurysm repair is better than open surgery or conservative management… I could go on.

And here we come to the crux of the matter, which is that such comments add nothing to a discussion about an intervention’s evidence base. Rather, their effect is to stifle debate into a confused silence. Whether this is done intentionally or out of embarrassment is irrelevant, the effect is the same: intellectual curiosity is suppressed and questioning is discouraged. This is the opposite of the Empiricism that underpins the whole of Western scientific thought. Before people asked questions, it was self evident that the Earth was flat, that it was the centre of the universe and that it was orbited by the sun. That was just common sense.

A strategy related to appeals to common sense is the weaponisation of the weight of collective opinion. Clinical trial design is dependent on equipoise, meaning clinicians do not know which of several options is better. Equipoise is dependent on opinion, and opinion is swayed by much more than evidence. Medical professionals are just as receptive to marketing, advertising, fashion and halo bias as anyone. Nihilistic statements denying a trial is possible (or even desirable) on the grounds that an intervention has become too popular or culturally embedded are only true if we allow them to be. The role of senior ‘key opinion leaders’ is critical here: they have a responsibility to openly question the status quo, to use their experience to identify and highlight the holes in the evidence, to point out the ‘elephant in the room’. But too often these leaders (supported in some cases by professional bodies and societies) become a mouthpiece for industry and vested interest, promoting dubious evidence, suppressing debate and inhibiting intellectual curiosity. There are notable examples of trials overcoming the hurdle of entrenched clinical practice and assessing deeply embedded cultural norms. This requires committed leaders who create a culture where doubt, equipoise and enquiry can flourish.

Given the rapid pace of technological development in modern healthcare, it is not unreasonable to have an opinion about an intervention that is not backed by the evidence of multiple congruent randomised controlled trials. But this opinion must be bounded by a realistic uncertainty. A better word for this state of mind is a reckoning. To reckon allows for doubt. Instead, when an opinion becomes a belief, doubt is squeezed out. ‘Can this be true?’ becomes ‘I want this to be true’, then ‘it is true’ and ultimately ‘it is self evidently true’. Belief becomes orthodoxy, and questioning becomes heresy and is actively (or passive aggressively) suppressed.

Karl Popper’s theory of empirical falsification states that a theory is only scientifically valid if it is falsifiable. In his book on assessing often incomplete espionage intelligence, David Omand (former head of the UK electronic intelligence, security and cyber agency, GCHQ) comments that the best theory is the one with least evidence against it. A powerful question therefore is not “what evidence do I need to demonstrate that this view of the world is right?” but its opposite: “what evidence would I need to demonstrate that this view of the world is wrong?”. Before the second Gulf War in 2002-3 an important question was whether Iraq had an ongoing chemical weapons programme. As we all know, no evidence was found (before or after the invasion). The theory with the least evidence against it is that Iraq had, indeed, destroyed its chemical weapons stockpile. More prosaically, that all swans are white is self evident until you observe a single black one. 

If someone is so sure that an intervention is self evidently effective, pointing out an experimental design to test this should be welcomed, not seen as a threat. But belief (as opposed to a reckoning) is tied up in identity, self worth and professional pride. What then does an impassioned advocate of a particular technique have to gain from an honest answer to the question “what evidence would it take for you to abandon this intervention as ineffective?” if that evidence is then produced.

Research is hard. Even before the tricky task of patient recruitment begins, a team with complimentary skills in trial design, statistics, decision making, patient involvement, data science and many more must be assembled. Funding and time must be identified. Colleagues must be persuaded that the research question is important enough to be prioritised amongst their other commitments. This process is time consuming, expensive and often results in failure, as my fruitless attempts at getting funding from the National Institute for Healthcare Research for studies on abdominal aortic aneurysm attest. But this is not say that we should not try. We are lucky in medicine that many of the research questions we face are solvable by the tools we have at our disposal if only we could deploy them rapidly and at scale. Unlike climate scientists we can design experiments to test our hypotheses. We do not have to rely on observational data alone.

The 2018 study on parachute use is often cited as a criticism of evidence based medicine. That a trial can produce such a bizarre result is extrapolated to infer that all trials are flawed (especially if they do not produce the desired result). My reading of the paper is that the authors have little sympathy for these arguments. After discussing the criticisms levelled at randomised trials they write with masterly understatement “It will be up to the reader to determine the relevance of these findings in the real world” and that the “…accurate interpretation [of a trial] requires more than a cursory reading of the abstract.”

As I wander around device manufacturers at medical conferences I wonder: what if more of the resource used to fund the glossy stands, baristas and masseuses were channelled into rigorous and independent research, then generating the evidence to support what we do would be so much easier. And I wonder why we tolerate a professional culture that so embraces orthodoxy, finds excuses to not undertake rigorous assessments of the new (and less new) interventions we undertake and is happy to allow glib statements about trial desirability, feasibility and generalisability, about parachutes and air, to go unchallenged.

Registry Data and the Emperor’s New Clothes

Registries. They’re a big thing in interventional radiology. Go to a conference and you’ll see multiple presentations describing a new device or technique as ‘safe and effective’ on the basis of ‘analysis of prospectively collected data’. National organisations (eg. the Healthcare Quality Improvement Partnership [HQIP] and the National Institute for Health and Care Excellence), professional societies (like the British Society for Interventional Radiology) and the medical device industry promote them, often enthusiastically.

The IDEAL collaboration is an organisation dedicated to quality improvement in research into surgery, interventional procedures and devices. It has recently updated its comprehensive framework for the evaluation of surgical and device based therapeutic interventions. The value of comprehensive data collection within registries is emphasised in this framework at all stages of development, from translational research to post-market surveillance.

Baroness Cumberledge’s report into failures in long term monitoring of new devices, techniques and drugs identified a lack of vigilant long-term monitoring as contributing to a system that is not safe enough for those being treated using these innovations. She recommended that a central database be created for implanted devices for research and audit into their long term outcomes. 

This is all eminently sensible. Registries, when properly designed and funded and with a clear purpose and goal are powerful tools in generating information about the interventions we perform. But I feel very uneasy about many registries because they often have unclear purpose, are poorly designed and are inadequately funded. At best they create data without information. At worst they cause harm by obscuring reality or suppressing more appropriate forms of assessment.

A clear understanding of the purpose of a registry is crucial to its design. Registries work best as tools to assess safety. In a crowded and expensive healthcare economy, this is an insufficient metric by which to judge a new procedure or device. Evidence of effectiveness relative to alternatives is crucial. If the purpose of a registry is to make some assessment of effectiveness, its design needs to reflect this.

The gold standard tool for assessing effectiveness is the randomised controlled trial [RCT]. These are expensive, time-consuming, and complex to set up and coordinate. As an alternative, a registry recruiting on the basis of a specific diagnosis (equivalent to RCT inclusion criteria) is ethically simpler and frequently cheaper to instigate. While still subject to selection bias, a registry recruiting on this basis can provide data on the relative effectiveness of the various interventions (or no intervention) offered to patients with that diagnosis. The registry data supports shared decision making by providing at least some data about all the options available. 

Unfortunately, most current UK and international interventional registries use the undertaking of the intervention (rather than the patient’s diagnosis) as the criterion for entry. The lack of data collection about patients who are in some way unsuitable for the intervention or opt for an alternative (such as conservative management) introduces insurmountable inclusion bias and prevents the reporting of effectiveness and cost-effectiveness compared with alternatives. The alternatives are simply ignored (or assumed to be inferior) and safety is blithely equated with effectiveness without justification or explanation. Such registries are philosophically anchored to the interests of the clinician (interested in the intervention) rather than to those of the patient (with an interest in their disease). They are useless for shared decision making. 

This philosophical anchoring is also evident in choices about registry outcome measures which are frequently those most easy to collect rather than those which matter most to patients: an perfect example of the McNamara (quantitative) fallacy. How often are patients involved in registry design at the outset? How often are outcome metrics relevant to them included, rather than surrogate endpoints of importance to clinicians and device manufacturers?

Even registries where the ambition is limited to post-intervention safety assessment or outcome prediction, and where appropriate endpoints are chosen, are frequently limited by methodological flaws. Lack of adequate statistical planning at the outset and collection of multiple baseline variables without consideration of the number of outcome events needed to allow modelling, risks overfitting and shrinkage – fundamental statistical errors.

Systematic inclusion of ‘all comers’ is rare, but failure to include all patients undergoing a procedure introduces ascertainment bias. Global registries often recruit apparently impressive numbers of patients, but scratch the surface and you find rates of recruitment that suggest a majority of patients were excluded. Why? Why include one intervention or patient but not another? Such recruitment problems also affect RCTs, resulting in criticisms about ‘generalisability’ or real world relevance, but its uncommon to see such criticism levelled at registry data, especially when it supports pre-existing beliefs, procedural enthusiasm, or endorses a product marketing agenda.

Finally there is the issue of funding. Whether the burden of funding and transacting post market surveillance should fall primarily onto professional bodies, the government or on the medical device companies that profit from the sale of their products is a subject for legitimate debate but in the meantime, registry funding rarely includes the provision for systematic longitudinal collation of long-term outcome data from all registrants. Pressured clinicians and nursing staff cannot prioritise data collection absent the time or funding to do this. Rather the assumption is (for example) that absence of notification of adverse outcome automatically represents a positive. Registry long term outcome data is therefore frequently inadequate. While potential solutions such as linkages to routinely collected datasets and other ‘big data’ initiatives are attractive, these data are often generic and rarely patient focussed. The information governance and privacy obstacles to linkages of this sensitive information are substantial.

Where does this depressing analysis leave us?

Innovative modern trial methodologies (such as cluster, preference, stepped wedge, trial-within-cohort or adaptive trials) provide affordable, robust, pragmatic and scalable alternative options for the evaluation of novel interventions and are deliverable within an NHS environment, though registries are still likely to have an important role to play. HQIP’s ‘Proposal for a Medical Device Registry’ defines key principles for registry development including patient and clinician inclusivity and ease of routine data collection using electronic systems. When these principles are adhered to, where registries are conceived and designed around a predefined specific hypothesis or purpose, where they are based on appropriate statistical methodology with relevant outcome measures, are coordinated by staff with the necessary skillsets to manage site, funding and regulatory aspects and are budgeted to ensure successful data collection and analysis, then they can be powerful sources of information about practice and novel technologies. This is a high bar but is achievable as the use of registry data during the COVID pandemic has highlighted. Much effort is being expended on key national registries (such as the National Vascular Registry) to try to improve the quality and comprehensiveness of the data collected and create links to other datasets.

But where these ambitions are not achieved we must remain highly skeptical about any evidence registry data purports to present. Fundamentally, unclear registry purpose, poor design and inadequate funding will guarantee both garbage in and garbage out.

Registry data is everywhere. Like the emperor’s new clothes, is it something you accept at face value, uncritically, because everyone else does? Do you dismiss the implications of registry design if the data interpretation matches your prejudice? Instead perhaps, next time you read a paper reporting registry data or are at a conference listening to a presentation about a ‘single arm trial’, be like the child in the story and puncture the fallacy. Ask whether there is any meaningful information left once the biases inherent in the design are stripped away.