Can artificial intelligence truly perform mathematics, or is it merely matching patterns from existing data? In June 2026, thirty mathematicians at Harvard’s Center for Mathematical Sciences and Applications sought to answer this by blind-grading AI-generated solutions to ten original, unpublished research-level problems.
The initiative, titled “First Proof, Second Batch,” featured an expert panel including Mohammed Abouzaid of Stanford, Nikhil Srivastava of UC Berkeley, Rachel Ward of UT Austin, and Lauren Williams of Harvard. By using problems absent from textbooks and online archives like arXiv, the team ensured the AI models could not rely on memorized training data.
Four leading AI systems, including models from OpenAI and Google, participated in the evaluation. The results were decisive: the panel awarded passing grades on seven of the ten problems. This marks a substantial improvement from preliminary trials where the same systems solved only two of the ten challenges.
This second assessment builds on earlier tests from February 2026, establishing an ongoing framework to track genuine capability growth rather than static benchmark performance. While standard competition math has known solutions, research-level mathematics requires navigating unknown territories where solution existence is uncertain. These findings suggest frontier models are moving beyond pattern recognition toward authentic logical deduction in complex, novel scenarios.