back to top
Wednesday, 30 April, 2025
HomeArtificial Intelligence (AI)AI chatbots outstrip doctors in diagnoses – US randomised study

AI chatbots outstrip doctors in diagnoses – US randomised study

A small study found that ChatGPT outdid human physicians when assessing medical case histories, and has suggested that although doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots.

Dr Adam Rodman, an expert in internal medicine at Beth Israel Deaconess Medical Centre in Boston, had confidently expected that chatbots built to use artificial intelligence would help doctors diagnose illnesses.

He was wrong, reports The New York Times.

Instead, in a study he helped design, doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot. And, to the researchers’ surprise, ChatGPT alone outperformed the doctors.

“I was shocked,” Rodman said.

The chatbot, from the company OpenAI, scored an average of 90% when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76%. Those randomly assigned not to use it had an average score of 74%.

The study showed more than just the chatbot’s superior performance.

It unveiled doctors’ sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.

Illustrated by the study was the fact that despite doctors being exposed to AI for their work, few know how to take advantage of the abilities of chatbots. As a result, they failed to exploit the systems’ ability to solve complex diagnostic problems and offer explanations for their diagnoses.

AI systems should be “doctor extenders”, Rodman said, offering valuable second opinions on diagnoses.

But it seems there is a way to go before that potential is realised.

Case history, case future

The experiment involved 50 doctors, a mix of residents and attending physicians recruited through a few large American hospital systems, and was published in JAMA Network Open.

The test subjects were given six case histories and were graded on their ability to suggest diagnoses and explain why they favoured or ruled them out. Their grades also included getting the final diagnosis right.

The graders were medical experts who saw only the participants’ answers, without knowing whether they were from a doctor with ChatGPT, a doctor without it or from ChatGPT by itself.

The case histories used in the study were based on real patients and are part of a set of 105 cases that has been used by researchers since the 1990s.

The cases intentionally have never been published so that medical students and others could be tested on them without any foreknowledge. That also meant that ChatGPT could not have been trained on them.

But, to illustrate what the study involved, the investigators published one of the six cases on which the doctors were tested, along with answers to the test questions on that case from a doctor who scored high and from one whose score was low.

That test case involved a 76-year-old patient with severe pain in his low back, buttocks and calves when he walked. The pain started a few days after he had been treated with balloon angioplasty to widen a coronary artery.

He had been treated with the blood thinner heparin for 48 hours after the procedure.

The man complained that he felt feverish and tired. His cardiologist had done lab studies that indicated a new onset of anaemia and a build-up of nitrogen and other kidney waste products in his blood.

The man had undergone bypass surgery for heart disease a decade earlier.

The case vignette continued to include details of the man’s physical exam, and then provided his lab test results.

The correct diagnosis was cholesterol embolism – a condition in which shards of cholesterol break off from plaque in arteries and block blood vessels.

Participants were asked for three possible diagnoses, with supporting evidence for each. They were also asked to provide, for each possible diagnosis, findings that do not support it or that were expected but not present.

Additionally, participants were asked to provide a final diagnosis.

Then they were to name up to three additional steps they would take in their diagnostic process.

Like the diagnosis for the published case, the diagnoses for the other five cases in the study were not easy to figure out. But neither were they so rare as to be almost unheard-of. Yet the doctors, on average, did worse than the chatbot.

What, the researchers asked, was going on?

The answer seems to hinge on questions of how doctors settle on a diagnosis, and how they use a tool like artificial intelligence.

The physician in the machine

How, then, do doctors diagnose patients?

The problem, said Dr Andrew Lea, a historian of medicine at Brigham and Women’s Hospital who was not involved with the study, is that “we really don’t know how doctors think”.

In describing how they came up with a diagnosis, doctors would say, “intuition”, or, “based on my experience”, Lea said.

That sort of vagueness has challenged researchers for decades as they tried to make computer programs that can think like a doctor.

The quest began almost 70 years ago.

“Ever since there were computers, there were people trying to use them to make diagnoses,” Lea added.

One of the most ambitious attempts began in the 1970s at the University of Pittsburgh. Computer scientists there recruited Dr Jack Myers, chairman of the medical school’s department of internal medicine who was known as a master diagnostician. He had a photographic memory and spent 20 hours a week in the medical library, trying to learn everything that was known in medicine.

Myers was given medical details of cases and explained his reasoning as he pondered diagnoses. Computer scientists converted his logic chains into code. The resulting program, called INTERNIST-1, included more than 500 diseases and about 3 500 symptoms of disease.

To test it, researchers gave it cases from the New England Journal of Medicine. “The computer did really well,” Rodman said. Its performance “was probably better than a human could do”, he added.

But INTERNIST-1 never took off. It was difficult to use, requiring more than an hour to give it the information needed to make a diagnosis. And, its creators noted, “the present form of the program is not sufficiently reliable for clinical applications”.

Research continued. By the mid-1990s there were about a half dozen computer programs that tried to make medical diagnoses. None came into widespread use.

“It’s not just that it has to be user friendly, but doctors had to trust it,” Rodman said.

And with the uncertainty about how doctors think, experts began to ask whether they should care. How important is it to try to design computer programs to make diagnoses the same way humans do?

“There were arguments over how much a computer program should mimic human reasoning,” Lea said. “Why don’t we play to the strength of the computer?”

The computer may not be able to give a clear explanation of its decision pathway, but does that matter if it gets the diagnosis right?

The conversation changed with the advent of large language models like ChatGPT. They make no explicit attempt to replicate a doctor’s thinking; their diagnostic abilities come from their ability to predict language.

“The chat interface is the killer app,” said Dr Jonathan Chen, a physician and computer scientist at Stanford who was an author of the new study.

“We can pop a whole case into the computer,” he said. “Before a couple of years ago, computers did not understand language.”

But many doctors may not be exploiting its potential.

Operator error

After his initial shock at the results of the new study, Rodman decided to probe a little deeper into the data and look at the actual logs of messages between the doctors and ChatGPT.

The doctors must have seen the chatbot’s diagnoses and reasoning, so why didn’t those using the chatbot do better?

It turns out that the doctors often were not persuaded by the chatbot when it pointed out something that was at odds with their diagnoses. Instead, they tended to be wedded to their own idea of the correct diagnosis.

“They didn’t listen to AI when it told them things they didn’t agree with,” Rodman said.

That makes sense, said Laura Zwaan, who studies clinical reasoning and diagnostic error at Erasmus Medical Centre in Rotterdam and was not involved in the study.

“People generally are overconfident when they think they are right,” she said.

But there was another issue: many of the doctors did not know how to use a chatbot to its fullest extent.

Chen said he noticed that when he peered into the doctors’ chat logs, “they were treating it like a search engine for directed questions: ‘Is cirrhosis a risk factor for cancer? What are possible diagnoses for eye pain?’

“It was only a fraction of the doctors who realised they could literally copy and paste the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question,” he added.

“Only a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing.”

Study details

Large language model influence on diagnostic reasoning a randomised clinical trial

Ethan Goh, Robert Gallo, Jason Hom, et al.

Published in JAMA Network Open on 28 October 2024

Abstract

Importance
Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves physician diagnostic reasoning.

Objective
To assess the effect of an LLM on physicians’ diagnostic reasoning compared with conventional resources.

Design, Setting, and Participants
A single-blind randomised clinical trial was conducted from November 29 to December 29, 2023. Using remote video conferencing and in-person participation across multiple academic medical institutions, physicians with training in family medicine, internal medicine, or emergency medicine were recruited.

Intervention
Participants were randomised to either access the LLM in addition to conventional diagnostic resources or conventional resources only, stratified by career stage. Participants were allocated 60 minutes to review up to 6 clinical vignettes.

Main Outcomes and Measures
The primary outcome was performance on a standardised rubric of diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps, validated and graded via blinded expert consensus. Secondary outcomes included time spent per case (in seconds) and final diagnosis accuracy. All analyses followed the intention-to-treat principle. A secondary exploratory analysis evaluated the standalone performance of the LLM by comparing the primary outcomes between the LLM alone group and the conventional resource group.

Results
Fifty physicians (26 attendings, 24 residents; median years in practice, 3 [IQR, 2-8]) participated virtually as well as at 1 in-person site. The median diagnostic reasoning score per case was 76% (IQR, 66%-87%) for the LLM group and 74% (IQR, 63%-84%) for the conventional resources-only group, with an adjusted difference of 2 percentage points (95% CI, −4 to 8 percentage points; P = .60). The median time spent per case for the LLM group was 519 (IQR, 371-668) seconds, compared with 565 (IQR, 456-788) seconds for the conventional resources group, with a time difference of −82 (95% CI, −195 to 31; P = .20) seconds. The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.

Conclusions and Relevance
In this trial, the availability of an LLM to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources. The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realise the potential of physician-artificial intelligence collaboration in clinical practice.

JAMA Network Open article – Large language model influence on diagnostic reasoning a randomised clinical trial (Open access)

 

The New York Times article – AI chatbots outdid doctors at diagnosing illness (Restricted access)

 

See more from MedicalBrief archives:

 

Growing role for AI in everyday medical interactions

 

ChatGPT diagnoses child’s illness after 17 doctors fail

 

Swift detection of cancer in mammograms by AI – Swedish study

 

WHO issues AI regulatory list

MedicalBrief — our free weekly e-newsletter

We'd appreciate as much information as possible, however only an email address is required.