AI helps people and doctors make medical decisions
The symptoms started with a twinge, a strange sensation in your chest, then an unexplainable fatigue hit. You sit on the couch, hesitating. Should you see a doctor? Wait it out? And if you need a medical opinion, should you see a cardiologist, an internist, a neurologist?
If you’re like most people, your first stop is the internet. But while there are myriad symptom checkers available online, they are rarely accurate. A 2022 study found that digital symptom checkers list the correct diagnosis first in only 19 to 38% of cases. When the top three suggestions were considered, accuracy rose to just 33–58%.
In a study published in “npj Digital Medicine,” Farieda Gaber in the lab of Dr. Altuna Akalin, Head of Bioinformatics and Omics Data Science at the Berlin Institute for Medical Systems Biology of the Max Delbrück Center (MDC-BIMSB), and collaborators evaluated whether large language models (LLMs) could do better. More specifically, they studied how well LLMs could guide people – and doctors – toward the right care.
“Studies show that up 30% of visits to emergency rooms are not necessary,” says Akalin, corresponding author on the paper. “If LLMs can reduce this number, it would help unburden healthcare systems.”
For the study, the team benchmarked four workflows based on startup Anthropic’s Claude models against 2,000 real emergency department admission cases from the MIMIC-IV-ED database, a large, public collection of anonymized health records from the Beth Israel Deaconess Medical Center in Boston.
The models were asked to do three things: suggest the appropriate medical specialist, provide diagnoses, and assess the urgency of the case – known as triage. They ran these tests in two scenarios: one mimicking a patient at home, with only symptoms and demographics provided; and one mimicking a clinician’s setting, where vital signs like heart rate and blood pressure were also available.
Start with the specialist
Unlike in Germany where patients usually need a referral from their primary care doctor, people in many other countries can directly seek out medical specialists. But figuring out which kind of doctor to see can be difficult. Would a gastroenterologist be best to evaluate abdominal pain? Or would a nephrologist be more appropriate?
In this task, the LLMs showed great promise. When only symptoms were provided, the Claude 3.5 Sonnet model listed an appropriate specialty among its top three suggestions in about 87% of cases. The other models performed similarly. Accuracy improved slightly when vital signs were included, as would be in a clinical setting.
Doctors reviewing the AI’s suggestions agreed: they rated 97% of the specialty recommendations as accurate or at least clinically acceptable.
The diagnosis
The models also performed well in predicting diagnoses. The best workflow, correctly identified a diagnosis in over 82% of cases. Accuracy improved further when vital signs were included, especially for the retrieval augmented generation (RAG) model that could consult a database of 30 million PubMed abstracts when making decisions.
To test how well the AI’s guesses matched human judgment, researchers used two types of review. In the first, where a prediction counted as correct if at least one of two independent doctors agreed with it, the AI aligned with human judgment more than 95% of the time. In the stricter version, where both doctors had to agree, accuracy remained above 70%.
Triage remains tricky
The models were less accurate in assessing triage. While none of the models confused a life-threatening condition with a mild one, they often misjudged mid-level cases. That matters: both over-triaging – prioritizing stable patients – and under-triaging – delaying care for serious cases – can harm patients. Trauma systems aim for under-triage rates below 5% – a goal no model in the study met.
Again, models with access to vital signs performed better, suggesting improvements in the type of data fed into the LLMs, medical test results for example, might help close the gap.
AI cannot replace doctors – but could assist them
“We are not suggesting that AI tools should replace clinicians,” says Akalin. “But well-designed, rigorously tested LLMs could serve as helpful aides – especially for less experienced providers.” Akalin and his colleagues would, however, like to see specific types of LLMs directly deployed to patients, especially those that help people find specialist doctors. They could replace less accurate symptom checkers to help parse whether and where to seek care, for example. They could also help overburdened healthcare systems by reducing unnecessary doctor and hospital visits, he adds.
Official use of such tools would need to meet strict regulatory standards under the EU’s AI Act. Nevertheless, the authors warn of “leaky deployment” – when publicly available AI tools start being used informally in clinical settings.
“That’s why open, rigorous benchmarking like this is so critical,” says Gaber. “Research like this helps us to understand both the promise and limits of AI-based medical decision-making.” Akalin and his team are planning to further test the value of LLMs to both patients and doctors in real world settings, such as a doctor’s office, using the platform 2ndOpin.io, which was developed in his lab. “The next question is if we build such a tool, is it really useful?” he says.
LLMs that improve patient care are a particular research focus for Akalin, who developed onconaut.ai – an online AI based tool for clinicians and patients that can help them better navigate personalized cancer therapies. Among other functions, cancer patients can enter their biomarker status and find a list of clinical trials for which they are eligible. He and his team recently improved the tool by “teaching” it to recognize all the different nomenclature that can be used to refer to the same biomarker – including spelling mistakes. Now patients searching for clinical trials can be more certain they are getting a complete list of clinical trials. The improvement to Onconaut’s search function was recently published in a paper in the same journal.
Text: Gunjan Sinha
Further information
Navigating cancer treatment with the help of AI
Literature
Farieda Gaber, Maqsood Shaik, Fabio Allega, et al. (2025) “Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis,” npj Digital Medicine DOI:10.1038/s41746-025-01684-1
Contacts
Dr. Altuna Akalin
Head, Bioinformatics and Omics Data Science
MDC-BIMSB
Altuna.Akalin@mdc-berlin.de
Gunjan Sinha
Editor, Communications
Max Delbrück Center
+49 30 9406-2118
Gunjan.Sinha@mdc-berlin.de or presse@mdc-berlin.de
- Max Delbrück Center
-
The Max Delbrück Center for Molecular Medicine in the Helmholtz Association aims to transform tomorrow’s medicine through our discoveries of today. At locations in Berlin-Buch, Berlin-Mitte, Heidelberg and Mannheim, our researchers harness interdisciplinary collaboration to decipher the complexities of disease at the systems level – from molecules and cells to organs and the entire organism. Through academic, clinical, and industry partnerships, as well as global networks, we strive to translate biological discoveries into applications that enable the early detection of deviations from health, personalize treatment, and ultimately prevent disease. First founded in 1992, the Max Delbrück Center today inspires and nurtures a diverse talent pool of 1,800 people from over 70 countries. We are 90 percent funded by the German federal government and 10 percent by the state of Berlin.