Artificial intelligence (AI) is “at best” on a par with human experts when it comes to making medical diagnoses based on images, that a review has found.
The potential for artificial intelligence in healthcare has caused excitement, with advocates saying it will ease the strain on resources, free up time for doctor-patient interactions and even aid the development of tailored treatment. Last month the UK government announced £250m of funding for a new National Health Service (NHS) artificial intelligence laboratory.
The Guardian reports, however, that experts have warned the latest findings are based on a small number of studies, since the field is littered with poor-quality research.
The report says one burgeoning application is the use of AI in interpreting medical images – a field that relies on deep learning, a sophisticated form of machine learning in which a series of labelled images are fed into algorithms that pick out features within them and learn how to classify similar images. This approach has shown promise in diagnosis of diseases from cancers to eye conditions.
However, questions remain about how such deep learning systems measure up to human skills. Now researchers say they have conducted the first comprehensive review of published studies on the issue, and found humans and machines are on a par. Professor Alastair Denniston, at the University Hospitals Birmingham NHS foundation trust and a co-author of the study, said the results were encouraging but the study was a reality check for some of the hype about AI.
Dr Xiaoxuan Liu, the lead author of the study and from the same NHS trust, agreed. “There are a lot of headlines about AI outperforming humans, but our message is that it can at best be equivalent,” she said.
According to the report, Denniston, Liu and colleagues reported how they focused on research papers published since 2012 – a pivotal year for deep learning. An initial search turned up more than 20,000 relevant studies. However, only 14 studies – all based on human disease – reported good quality data, tested the deep learning system with images from a separate dataset to the one used to train it, and showed the same images to human experts.
The team pooled the most promising results from within each of the 14 studies to reveal that deep learning systems correctly detected a disease state 87% of the time – compared with 86% for healthcare professionals – and correctly gave the all-clear 93% of the time, compared with 91% for human experts.
However, the healthcare professionals in these scenarios were not given additional patient information they would have in the real world which could steer their diagnosis. Professor David Spiegelhalter, the chair of the Winton Centre for risk and evidence communication at the University of Cambridge, said in the report that the field was awash with poor research.
“This excellent review demonstrates that the massive hype over AI in medicine obscures the lamentable quality of almost all evaluation studies,” he said. “Deep learning can be a powerful and impressive technique, but clinicians and commissioners should be asking the crucial question: what does it actually add to clinical practice?”
However, the report says Denniston remained optimistic about the potential of AI in healthcare, saying such deep learning systems could act as a diagnostic tool and help tackle the backlog of scans and images. What’s more, said Liu, they could prove useful in places which lack experts to interpret images. Liu said it would be important to use deep learning systems in clinical trials to assess whether patient outcomes improved compared with current practices.
Dr Raj Jena, an oncologist at Addenbrooke’s Hospital in Cambridge who was not involved in the study, said deep learning systems would be important in the future, but stressed they needed robust real-world testing. He also said it was important to understand why such systems sometimes make the wrong assessment. “If you are a deep learning algorithm, when you fail you can often fail in a very unpredictable and spectacular way,” he said.
Background: Deep learning offers considerable promise for medical diagnostics. We aimed to evaluate the diagnostic accuracy of deep learning algorithms versus health-care professionals in classifying diseases using medical imaging.
Methods: In this systematic review and meta-analysis, we searched Ovid-MEDLINE, Embase, Science Citation Index, and Conference Proceedings Citation Index for studies published from Jan 1, 2012, to June 6, 2019. Studies comparing the diagnostic performance of deep learning models and health-care professionals based on medical imaging, for any disease, were included. We excluded studies that used medical waveform data graphics material or investigated the accuracy of image segmentation rather than disease classification. We extracted binary diagnostic accuracy data and constructed contingency tables to derive the outcomes of interest: sensitivity and specificity. Studies undertaking an out-of-sample external validation were included in a meta-analysis, using a unified hierarchical model. This study is registered with PROSPERO, CRD42018091176.
Findings: Our search identified 31 587 studies, of which 82 (describing 147 patient cohorts) were included. 69 studies provided enough data to construct contingency tables, enabling calculation of test accuracy, with sensitivity ranging from 9·7% to 100·0% (mean 79·1%, SD 0·2) and specificity ranging from 38·9% to 100·0% (mean 88·3%, SD 0·1). An out-of-sample external validation was done in 25 studies, of which 14 made the comparison between deep learning models and health-care professionals in the same sample. Comparison of the performance between health-care professionals in these 14 studies, when restricting the analysis to the contingency table for each study reporting the highest accuracy, found a pooled sensitivity of 87·0% (95% CI 83·0–90·2) for deep learning models and 86·4% (79·9–91·0) for health-care professionals, and a pooled specificity of 92·5% (95% CI 85·1–96·4) for deep learning models and 90·5% (80·6–95·7) for health-care professionals.
Interpretation: Our review found the diagnostic performance of deep learning models to be equivalent to that of health-care professionals. However, a major finding of the review is that few studies presented externally validated results or compared the performance of deep learning models and health-care professionals using the same sample. Additionally, poor reporting is prevalent in deep learning studies, which limits reliable interpretation of the reported diagnostic accuracy. New reporting standards that address specific challenges of deep learning could improve future studies, enabling greater confidence in the results of future evaluations of this promising technology.
Xiaoxuan Liu, Livia Faes, Aditya U Kale, Siegfried K Wagner, Dun Jack Fu, Alice Bruynseels, Thushika Mahendiran, Gabriella Moraes, Mohith Shamdas, Christoph Kern, Joseph R Ledsam, Martin K Schmid, Konstantinos Balaskas, Eric J Topol, Lucas M Bachmann, Pearse A Keane Alastair K Denniston