AI Models Fail Critical 'Bullshit' Detection Test

A new benchmark test, BullshitBench, designed to measure AI's ability to detect nonsensical queries, has revealed a critical flaw in most major models. These AIs frequently and confidently answer unanswerable prompts, failing to identify the inherent lack of logic.

The benchmark includes 100 questions across domains like software, finance, legal, medical, and physics. Each question is crafted to sound legitimate but contains a fundamental premise that makes it impossible to answer. The expected correct response is to identify the question as nonsensical.

However, most models proceed to offer elaborate, fabricated analyses. For instance, when asked how switching screw types in a bathroom cabinet might affect food flavor in a kitchen pantry, many AIs provided detailed, yet baseless, explanations. Similarly, a physics query about font choice on a pendulum's scale affecting its period was treated as a genuine scientific problem by some models.

Anthropic's Claude models demonstrated superior performance, with Claude Sonnet 4.6 achieving 91% accuracy in identifying nonsense. Conversely, Google's Gemini models and OpenAI's GPT models showed significantly lower scores, with some models pushing back on less than 20% of the nonsensical prompts. This failure to recognize absurdity points to an 'insidious flavor' of hallucination, where AI confidently generates misinformation, potentially leading to serious real-world consequences.

Experts note that standard training procedures often reward AI for guessing over acknowledging uncertainty, contributing to this problem. The BullshitBench results underscore the need for AI systems that can not only generate information but also critically evaluate the validity of the questions they are asked.