Not a real doctor: AI struggles to treat human patients

Just because AI chatbots can pass medical exams doesn’t mean they should act as doctors, they could provide deadly medical advice, a study warns.
In one case, two men who asked an AI tool about a brain hemorrhage received the opposite guidance; One was encouraged to seek immediate medical attention, while the other was told to rest in a quiet room.
Researchers from the University of Oxford published the warning in the journal Nature Medicine on Tuesday, following a large-scale randomized trial involving 1,300 participants.
The research comes a month after OpenAI announced its ChatGPT Health service, revealing that the popular AI tool answers more than 230 million health-related questions every day.
The study, created by researchers at the Oxford Internet Institute and the University of Oxford, tested whether three large language models could help people identify a medical condition and determine the right course of action to treat it.
Each of the 1,298 participants was given one of 10 medical scenarios involving conditions ranging from colds and pneumonia to gallstones and pulmonary embolism.
One group used traditional information sources such as web searches to determine their status, while the other group used major language models such as GPT-4o used by ChatGPT, Llama 3 from Meta, and Command R+.
Only 34.5 percent of respondents using AI tools identified their medical condition, and 44.2 percent found the right action to take, whether it was calling an ambulance or taking care of themselves at home.
In one case, two participants who described symptoms of subarachnoid hemorrhage were given different guidance by GPT-4o; only one was told to seek medical attention immediately; when a participant described gallstones, indigestion or reflux was suggested.
University of Oxford Clarendon-Reuben PhD student Rebecca Payne said the findings highlighted some of the challenges of creating technology to solve human problems.
“Despite all the hype, AI is not yet ready to take on the role of doctor,” Dr Payne said.
“Patients need to be aware that asking a broad language pattern about their symptoms can be dangerous, lead to incorrect diagnoses, and fail to recognize when immediate help is needed.”
AI tools made more accurate medical assessments when human participants were not involved in questions, correctly diagnosing conditions in 94.7 percent of cases and providing the correct course of treatment in more than half (56.3 percent).
Oxford Internet Institute researcher and lead author Andrew Bean said the findings point to a communication gap between humans and AI and the need to find better ways to test software solutions.
“Interaction with humans poses a challenge even for high-end vehicles,” he said.
“Designing robust tests for large language models is key to understanding how we can leverage this new technology.”

