OpenAI has released GeneBench, a rigorous benchmark designed to test artificial intelligence on the most challenging problems in computational biology. The benchmark, detailed in a bioRxiv preprint, assesses AI models on 103 complex tasks spanning ten domains in genomics and quantitative biology.

Each problem requires the kind of multi-stage analysis that typically demands 10 to 40 hours from a senior human scientist. Current AI models perform poorly. GPT-5.5 Pro achieved the highest pass rate at just 33.2%, while over 60% of problems saw pass rates below 20% for all tested models.

The benchmark provides a concrete measure of AI's current capabilities in biomedical research. It highlights a significant gap between AI's promise for drug discovery and genomic analysis and its present, limited performance on real-world scientific tasks.

An efficiency finding is also notable. A specialized model, GPT-Rosalind, achieved a comparable 21.6% pass rate to GPT-5.5 while using 31% fewer computational tokens. This suggests a path toward more economical deployment of AI in research workflows.