back to top
Thursday, 19 February, 2026
HomeArtificial Intelligence (AI)AI medical diagnoses may include fake health info – US study

AI medical diagnoses may include fake health info – US study

An alarming study has found that large language models like ChatGPT, while increasingly being used in healthcare, will accept fake medical claims if they are presented as realistic in medical notes and social media discussions, according to the researchers.

And AI is also unable to distinguish nuances when asked to diagnosis a medical symptom, warn experts, so you may not be getting the correct feedback, which poses all sorts of potential hazards.

Understanding the limitations of AI diagnoses

We’ve all been there – found a mysterious ache or unexplained rash that invariably would lead us straight to “Doctor Google”: you type in a mild cough and, three clicks later, you’re convinced you have a rare, tropical disease.

In the era of digital hypochondria, reports IOL, search results often led to more panic than peace of mind, but today, consultations have shifted from search bars to sophisticated AI tools like ChatGPT.

With their calm, authoritative tone and ability to process vast amounts of data in seconds, it’s tempting to treat these bots as a pocket-sized medical specialist.

However, the tech giants themselves are urging caution, with Google recently removing several AI-generated health summaries from its search results after investigations revealed inaccuracies in its responses.

Even OpenAI, the creator of ChatGPT, includes a firm disclaimer at the bottom of medical-related interactions: “ChatGPT can make mistakes. Check important info. This tool is not intended for medical diagnosis or treatment.”

The authors of the recent study, published in The Lancet Digital Health, said that some of these leading AI systems can mistakenly repeat false health information if it’s presented in realistic medical language, reports Euronews.

The study analysed more than 1m prompts, with researchers wanting to answer one question: when a false medical statement is phrased credibly, will a model repeat it or reject it?

The authors said that while AI has the potential to be a real help for clinicians and patients, offering faster insights and support, the models need built-in safeguards that check medical claims before they are presented as fact.

“Our study shows where these systems can still pass on false information, and points to ways we can strengthen them before they are embedded in care,” they said.

The researchers at Mount Sinai Health System in New York tested 20 LLMs spanning major model families – including OpenAI’s ChatGPT, Meta’s Llama, Google’s Gemma, Alibaba’s Qwen, Microsoft’s Phi, and Mistral AI’s model – as well as multiple medical fine-tuned derivatives of these base architectures.

AI models were prompted with fake statements, including false information inserted into real hospital notes, health myths from Reddit posts, and simulated healthcare scenarios.

Across all of the models tested, LLMs fell for made-up information about 32% of the time, but results varied widely. The smallest or less advanced models believed false claims more than 60% of the time, while stronger systems, such as ChatGPT-4o, did so only 10% of the cases.

The study also found that medical fine-tuned models consistently underperformed compared with general ones.

“Current AI systems can treat confident medical language as true by default, even when it’s clearly wrong,” said co-senior and co-corresponding author Eyal Klang from the Icahn School of Medicine at Mount Sinai.

“For these models, what matters is less whether a claim is correct than how it is written.”

Fake claims and harmful consequences

The researchers warn that some prompts from Reddit comments, for instance, accepted by LLMs, have the potential to harm patients.

At least three different models accepted misinformed facts such as “Tylenol can cause autism if taken by pregnant women”, “rectal garlic boosts the immune system”, “mammography causes breast cancer by ‘squashing’ tissue’”, and “tomatoes thin the blood as effectively as prescription anticoagulants”.

In another example, a discharge note falsely advised patients with oesophagitis-related bleeding to “drink cold milk to soothe the symptoms”. Several models accepted the statement rather than flagging it as unsafe, and treated it like ordinary medical guidance.

The models reject fallacies

The researchers also tested how models responded to information given in the form of a fallacy – convincing arguments that are logically flawed – such as “everyone believes this, so it must be true” (an appeal to popularity).

They found that, in general, this phrasing made models reject or question the information more easily.

However, two specific fallacies made AI models slightly more gullible: appealing to authority and slippery slope. Models accepted 34.6% of fake claims that included the words “an expert says this is true”.

When prompted “if X happens, disaster follows”, AI models accepted 33.9% of fake statements.

Next steps

The authors said the next step is to treat “can this system pass on a lie?” as a measurable property, using large-scale stress tests and external evidence checks before AI is built into clinical tools.

“Hospitals and developers can use our dataset as a stress test for medical AI,” said Mahmud Omar, the first author of the study.

“Instead of assuming a model is safe, you can measure how often it passes on a lie, and whether that number falls in the next generation,” he added.

Human ‘puzzle-solving’

Johannesburg’s Dr Jess Morris, a GP at Mediclinic Morningside, agreed that AI is an incredible starting point for gathering general information, but emphasised that it lacks the human “puzzle-solving” ability required for a real diagnosis.

The primary risk of relying on AI isn’t just getting the wrong answer; it’s getting a “false okay”, she added.

AI cannot interpret factors like blood pressure trends, cholesterol profiles, or prediabetes risks in the personalised way a healthcare specialist can.

“In some cases, this can cause unnecessary stress, but in others it can create false reassurance, delaying professional care when early intervention could make a critical difference,” said Morris.

She cited three common health questions that “Dr AI” simply cannot answer with the necessary nuance:

Do I have high blood pressure?

The “silent killer”remains one of the hardest conditions to self-diagnose via a chatbot. Hypertension often lacks any obvious symptoms until it reaches a dangerous level.

“By the time symptoms like chest pain, blurred vision or anxiety appear, a person may already be experiencing a hypertensive crisis,” Morris said.

“A single reading, or an online reference range, cannot show how blood pressure behaves over time or during sleep, nor can it reflect a person’s risk of stroke or heart disease.”

The fix: Morris recommends professional 24-hour blood pressure monitoring at least twice a year to see the full picture.

Should I worry about my cholesterol level?

An AI can tell you what a “normal” range looks like on paper, but it doesn’t know you. It cannot assess how your levels interact with your specific family history or lifestyle.

“Many people with abnormal lipid levels feel perfectly well until they experience a heart attack or stroke,” notes Morris. Doctors look at your results alongside your weight, age, and other conditions like diabetes to decide if you need a salad or a prescription.

The fix: If you have a family history of heart issues, start screening at 16. Otherwise, annual checks from age 30 are a must.

Am I prediabetic?

Prediabetes is a grey area where AI often trips up. Slightly high blood sugar doesn’t guarantee you’ll develop diabetes, but it’s not a ‘safe’ zone either.

“Prediabetes often has no symptoms and is typically identified through screening tests such as HbA1C, which reflects blood sugar levels over several months rather than at a single point in time,” said Morris.

Clinicians must weigh these results against weight changes and activity levels to create a plan that could actually reverse the condition.

Use AI as a map, not a destination

AI is a brilliant tool for understanding terms or learning how a specific medication works in theory. It adds value by helping you frame better questions for your doctor, but it cannot replace clinical judgment.

“Medical test results are pieces of a larger puzzle, not definitive answers in isolation,” Morris said.

“When it comes to understanding your health, there is no shortcut that replaces a conversation with a qualified healthcare professional who can consider the full context and guide you towards appropriate care.”

Study details

Mapping the susceptibility of large language models to medical misinformation across clinical notes and social media: a cross-sectional benchmarking analysis

Mahmud Omar, Vera Sorin, Carol Horowitz, et al.

Published in The Lancet Digital Health in January 2026

Summary

Background
Large language models (LLMs) are increasingly used in healthcare but remain vulnerable to medical misinformation. We aimed to evaluate how often these models accept or reject fabricated medical content, and how framing that content as a logical fallacy changes results.

Methods
In this cross-sectional benchmarking analysis, we probed 20 LLMs with more than 3.4m prompts that all contained health misinformation drawn from three sources: public-forum and social-media dialogues, real hospital discharge notes in which we inserted a single false recommendation, and 300 physician-validated simulated vignettes. Logical fallacies – common patterns of flawed reasoning such as appeals to authority, popularity, or emotion – were used to test how rhetorical framing influences model behaviour. Each prompt was posed once in a neutral base form and ten times with a named logical fallacy. For every run we logged susceptibility (model accepts the false claim) and fallacy detection (model flags the rhetoric).

Findings
Across all models and corpora, LLMs were susceptible to fabricated data in 50 108 (31·7%) of 158 000 base prompts. Eight of 10 fallacy framings significantly reduced or did not change that rate, led by appeal to popularity (susceptibility 11.9%; difference of –19.8 percentage points; p<0·0001); only the slippery-slope prompt (33.9%; difference of 2.2 percentage points; p<0·0001) and the appeal-to-authority prompt (34.6%; difference of 2.9 percentage points; p<0·0001) increased it. Real hospital notes (with fabricated inserted elements) produced the highest susceptibility to the base prompt (46 108 [46.1%] of 100 000), whereas social-media misinformation showed lower base prompt susceptibility (2479 [8.9%] of 28 000). Performance varied by model: GPT models were the least susceptible and most accurate at fallacy detection, whereas others, such as Gemma-3–4B-it, showed 63.6% (5023 of 7 900) susceptibility.

Interpretation
These results show that LLMs still absorb harmful medical fabrications, especially when phrased in authoritative clinical prose, yet, counter-intuitively, become less vulnerable when the same claims are wrapped in most logical fallacy styles. Therefore, improving safety appears to depend less on model scale and more on fact-grounding and context-aware guardrails.

 

The Lancet Digital Health article – Mapping the susceptibility of large language models to medical misinformation across clinical notes and social media: a cross-sectional benchmarking analysis (Open access)

 

Euronews article – ChatGPT and other AI models believe medical misinformation on social media, study warns (Open access)

 

IOL article – Using AI for health: A guide to understanding its limitations (Open access)

 

See more from MedicalBrief archives:

 

Botched procedures rise as AI medical use climbs

 

AI algorithms in diagnosis could harm patients – Dutch study

 

The risks of ChatGPT in healthcare

MedicalBrief — our free weekly e-newsletter

We'd appreciate as much information as possible, however only an email address is required.