Study Finds AI Outperforms Doctors in Emergency Diagnoses

Doctors were one of the job profiles that most predicted would not be affected by the AI takeover. But a new study from Harvard suggests otherwise.

For a study titled “Performance of a large language model on a doctor’s reasoning tasks,” researchers evaluated the medical reasoning capabilities of OpenAI o1, an updated large language model, in six different experiments and compared the model to hundreds of expert doctors. They found that, at least in the case of high-pressure emergency medicine triage, the models outperformed human doctors and made more accurate diagnoses in the potentially life-or-death moments when people were first admitted to the hospital.

In a trial in a hospital emergency room in Boston, o1 outperformed GPT-4o with two attending physicians at three diagnostic touchpoints: emergency department triage, emergency room physician, and admission to the medical floor or intensive care unit. The advantage was strongest in the early stage, when information was scarce and quick decisions were required.

In another experiment, they tested AI using a well-known series of challenging medical case studies that have been used since the 1950s to assess how doctors think through diagnoses.

The results showed that the AI included the correct disease in the list of possible diagnoses in approximately 78% of cases. In more than half of the cases (52%), the initial prediction was correct. When considering responses that were “potentially useful or very close diagnoses,” the AI’s accuracy increased to 97.9%.

In addition to identifying diseases, the AI has also been tested to recommend follow-up patient care, including choosing the right medical test. He chose the correct test in 87.5% of cases, and in another 11% his recommendation was good enough to be considered useful.

In simulated patient cases, OpenAI o1 received a perfect score in 78 out of 80 cases; “It performed significantly better than GPT-4 (47/80), attending physicians (28/80), and junior physicians (16/72).”

Help – no substitute for human doctors

Although at first glance this seems groundbreaking, there is a problem. The tests were entirely text-based; This means that as of now, even the most advanced model can only be trusted as a second opinion. The study notes that real-life clinical medicine is “multifaceted and full of extra-textual input,” and emphasizes that auditory input, such as the patient’s distress level, or visual input, such as interpretation of medical imaging studies, were not tested on the model.

“The integration of AI into emergency care must be approached clearly. It is a tool that can improve clinical practice when used appropriately. It cannot replace the doctor at the bedside,” says Dr Nayan Sriramula, Head of Emergency Medicine and Trauma, Medicover Hospitals.

“The challenge lies in using technology to strengthen systems without undermining the role of clinical judgment. In Indian emergency departments, where unpredictability is constant, the ability to make timely and context-sensitive decisions is vital,” Dr Nayan added.

The study suggested that although applying AI to aid clinical decision support is considered a high-risk endeavor, greater use of these tools could serve to reduce the human and financial costs resulting from diagnostic error, delay, and lack of access.