A new study from Harvard Medical School and Beth Israel Deaconess Medical Center reveals that large language models (LLMs) outperformed physicians in critical emergency care decisions. The AI, specifically OpenAI's o1-preview model, excelled in triage, diagnosis, and management reasoning-often with less information than human doctors.
"We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said co-senior author Dr. Arjun Manrai. However, he cautioned that rigorous prospective trials are still needed before deployment in real-world clinical settings.
Researchers noted the most significant performance gap occurred during triage, when patient data is limited. As more information became available, both AI and human doctors improved their accuracy.
The study's authors call for healthcare systems to invest in computing infrastructure and develop frameworks for safe AI integration, while emphasizing that humans remain the ultimate baseline for evaluating performance and safety.