AI Models Disagree on Two-Thirds of Fact-Checks, Lenz Research Study Finds

Ask five leading AI models whether a statement is true, and two-thirds of the time, at least one will disagree with the rest. That is the core finding from a new Lenz Research study examining GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro on 1,000 real-world fact-check claims submitted by users.

The results: 672 out of 1,000 claims, or 67%, saw at least one model dissent from the panel majority. 343 claims, roughly 34%, involved what researchers call substantive disagreements, where the two most opposed models landed at least two verdict categories apart on a scale from True to False.

Using Krippendorff’s alpha, a standard measure of inter-rater reliability, the study scored 0.639 on an ordinal scale. A score of 1.0 means perfect agreement; anything below 0.667 is considered too low for researchers to draw confident conclusions.

The dataset reflected real-world conditions, with all claims submitted after February 15, 2026. Models did not disagree randomly but systematically: some leaned toward binary True/False verdicts, while others distributed answers across the middle categories like Mostly True and Misleading.

For crypto investors, the takeaway is clear: relying on a single AI model for trading signals or market analysis is risky. When multiple models agree, confidence increases; when they split, human judgment is required.