back to top
Wednesday, 4 March, 2026
HomeArtificial Intelligence (AI)ChatGPT Health fails to ID emergencies – US study

ChatGPT Health fails to ID emergencies – US study

Although ChatGPT Health performed well in textbook emergencies like stroke or severe allergic reactions, it regularly misses the need for medical urgent care and frequently fails to detect suicidal ideation, a study of the AI platform has found, which US experts worry could “feasibly lead to unnecessary harm and death”, reports The Guardian.

OpenAI launched the “Health” feature of ChatGPT to limited audiences in January, which it touts as a way for users to “securely connect medical records and wellness apps” to generate health advice and responses. Daily, more than 40m people reportedly ask ChatGPT for health-related advice.

The first independent safety evaluation of ChatGPT Health, published in Nature Medicine, found it under-triaged more than half of the cases presented to it.

The lead author of the study, Dr Ashwin Ramaswamy, a urology instructor at the Icahn School of Medicine at Mount Sinai in the United States, said: “We wanted to answer the most basic safety question: if someone is having a real medical emergency and asks ChatGPT Health what to do, will it tell them to go to the emergency department?”

Ramaswamy and his colleagues created 60 realistic patient scenarios covering health conditions from mild illnesses to emergencies. Three independent doctors reviewed each scenario and agreed on the level of care needed, based on clinical guidelines.

The team then asked ChatGPT Health for advice on each case under different conditions, including changing the patient’s gender, adding test results, or adding comments from family members, generating nearly 1 000 responses.

Thereafter, they compared the platform’s recommendations with the doctors’ assessments.

They said while it did well in textbook emergencies like stroke or severe allergic reactions, it battled in other situations. In one asthma scenario, it advised waiting rather than seeking emergency treatment, despite the platform identifying early warning signs of respiratory failure.

In 51.6% of cases where someone needed to go to the hospital immediately, the platform said stay home or book a routine medical appointment, a result Alex Ruani, a doctoral researcher in health misinformation mitigation with University College London, described as “unbelievably dangerous”.

“If you’re experiencing respiratory failure or diabetic ketoacidosis, you have a 50/50 chance of this AI telling you it’s not a big deal,” she said. “What worries me most is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life.”

In one of the simulations, eight times out of 10 (84%), the platform sent a suffocating woman to a future appointment she would not live to see, Ruani said. Meanwhile, 64.8% of completely safe individuals were told to seek immediate medical care, said Ruani, who was not involved in the study.

The platform was also nearly 12 times more likely to downplay symptoms because the “patient” told it a “friend” suggested it was nothing serious.

“It is why many of us studying these systems are focused on urgently developing clear safety standards and independent auditing mechanisms to reduce preventable harm,” Ruani said.

A spokesperson for OpenAI said while the company welcomed independent research evaluating AI systems in healthcare, the study did not reflect how people typically use ChatGPT Health in real life. The model is also continuously updated and refined, the spokesperson said.

Ruani said even though simulations created by the researchers were used, “a plausible risk of harm is enough to justify stronger safeguards and independent oversight”.

Ramaswamy said he was particularly concerned by the platform’s under-reaction to suicide ideation.

“We tested ChatGPT Health with a 27-year-old patient who said he’d been thinking about taking a lot of pills,” he said. When the patient described his symptoms alone, the crisis intervention banner linking to suicide help services appeared every time.

“Then we added normal lab results,” Ramaswamy said. “Same patient, same words, same severity. The banner vanished. Zero out of 16 attempts. A crisis guardrail that depends on whether you mentioned your labs is not ready, and it’s arguably more dangerous than having no guardrail at all, because no one can predict when it will fail.”

Professor Paul Henman, a digital sociologist and policy expert with the University of Queensland, said: “This is a really important paper.

“If ChatGPT Health was used by people at home, it could lead to higher numbers of unnecessary medical presentations for low-level conditions and a failure of people to obtain urgent medical care when required, which could feasibly lead to unnecessary harm and death.”

He said it also raised the prospects of legal liability, with legal cases against tech companies already in motion in relation to suicide and self-harm after using AI chatbots.

“It is not clear what OpenAI is seeking to achieve by creating this product, how it was trained, what guardrails it has introduced and what warnings it provides to users,” Henman said.

“Because we don’t know how ChatGPT Health was trained and what the context it was using, we don’t really know what is embedded into its models.”

Study details

ChatGPT Health performance in a structured test of triage recommendations

Ashwin Ramaswamy, Alvira Tyagi, Hannah Hugo et al.

Published in Nature on 23 February 2026

Abstract

ChatGPT Health launched in January 2026 as OpenAI’s consumer health tool, reaching millions of users. Here, we conducted a structured stress test of triage recommendations using 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions (960 total responses). Performance followed an inverted U-shaped pattern, with the most dangerous failures concentrated at clinical extremes: non-urgent presentations (35%) and emergency conditions (48%). Among gold-standard emergencies, the system under-triaged 52% of cases, directing patients with diabetic ketoacidosis and impending respiratory failure to 24–48-hour evaluation rather than the emergency department, while correctly triaging classical emergencies such as stroke and anaphylaxis. When family or friends minimized symptoms (anchoring bias), triage recommendations shifted significantly in edge cases (OR 11.7, 95% CI 3.7-36.6), with the majority of shifts toward less urgent care. Crisis intervention messages activated unpredictably across suicidal ideation presentations, firing more when patients described no specific method than when they did. Patient race, gender, and barriers to care showed no significant effects, though confidence intervals did not exclude clinically meaningful differences. Our findings reveal missed high-risk emergencies and inconsistent activation of crisis safeguards, raising safety concerns that warrant prospective validation before consumer-scale deployment of artificial intelligence triage systems.

 

Nature article – ChatGPT Health performance in a structured test of triage recommendations (Open access)

 

The Guardian article – ‘Unbelievably dangerous’: experts sound alarm after ChatGPT Health fails to recognise medical emergencies (Open access)

 

See more from MedicalBrief archives:

 

AI medical diagnoses may include fake health info – US study

 

Botched procedures rise as AI medical use climbs

 

ChatGPT diagnoses child’s illness after 17 doctors fail

 

The risks of ChatGPT in healthcare

 

ChatGPT adept at answering common public health questions – US study

MedicalBrief — our free weekly e-newsletter

We'd appreciate as much information as possible, however only an email address is required.