A recent study published in Science reveals that a large language model (LLM) outperforms hundreds of physicians across a range of clinical reasoning and diagnostic tasks.

In evaluations using established clinical case conferences, the LLM correctly identified the final diagnosis in up to 78% of cases, with the top suggested diagnosis accurate more than half the time. Performance further improved when applying broader diagnostic criteria.

In real-world emergency department settings, the AI system achieved correct or near-correct diagnoses in up to 81.6% of cases at hospital admission, surpassing attending physicians. The performance gap was most significant during early triage, when information is limited and rapid decisions are critical.

Beyond diagnosis, the model demonstrated strong capabilities in estimating clinical probabilities, selecting appropriate investigations, and generating structured differential diagnoses. In several benchmarks, it exceeded both previous AI models and physician groups using standard clinical resources.

Researchers caution that current evaluations rely on structured, text-based scenarios that may not capture the full complexity of real-world care, which includes imaging interpretation and bedside assessment. They emphasize the need for prospective clinical trials to assess safety and effectiveness before wider adoption.

The findings suggest AI could support diagnostic decision-making, especially in settings with limited specialist access, but human-AI collaboration and rigorous real-world validation remain essential.