Generative artificial intelligence (AI) still lacks the reasoning capabilities for safe clinical use, a new study from Mass General Brigham has found.

Despite improvements, AI chatbots failed to produce an appropriate differential diagnosis more than 80% of the time. Researchers evaluated 21 large language models (LLMs) using standardized clinical vignettes and a new tool, PrIME-LLM, to assess reasoning across diagnosis, testing, and treatment planning.

The study found LLMs struggled with the initial open-ended stage of a case, particularly in generating differential diagnoses and navigating uncertainty. While AI achieved high accuracy on final diagnoses once complete data was provided, it fell short in the critical first step of identifying a condition among similar symptoms.

Models like Grok 4, GPT-5, and Gemini 3.0 Pro showed stronger performance but still require significant human oversight. Researchers emphasize that off-the-shelf LLMs are not ready for unsupervised clinical deployment, reinforcing the indispensable role of human clinical judgment in healthcare.