Key Takeaways
- When asked to verify 1,000 real-world claims, five cutting-edge AI models reached conflicting conclusions 67% of the time.
- Complete consensus among all models was achieved on just 328 out of the 1,000 claims.
- With a Krippendorff’s alpha score of 0.639, the models failed to meet the 0.8 threshold typically required for reliable inter-rater consistency.
Pose the same fact-checking question to five of the world’s most advanced AI systems, and roughly two-thirds of the time, you’ll receive contradictory responses. That’s the central finding from a May study conducted by Kosta Jordanov of Lenz Research.
The participating models—GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro—were each presented with the same 1,000 real-world fact-check claims sourced from genuine user submissions. For each claim, the models had to assign one of four categories: true, mostly true, misleading, or false.
Across the full set of 1,000 claims, at least one model diverged from the majority on 672 occasions. In a striking 34% of cases, the disagreement was particularly sharp—one model rated a claim as true while another classified it as false.
‘These aren’t standard benchmark questions with publicly available answers—they’re real claims that everyday users submitted to a fact-checking service,’ the study explains. ‘Since only one label can be correct for each claim, whenever the panel disagrees, at least one model is assigning an inconsistent verdict under this four-category framework.’
Earlier research on AI hallucinations has highlighted how chatbots fabricate information entirely. That’s a separate issue. The challenge revealed here is fundamentally different. The models aren’t necessarily spewing made-up facts—they simply can’t reach a shared conclusion on basic factual assessments of identical material.
The study was deliberately designed to minimize the ability of AI companies to dismiss the findings. Rather than sourcing claims from well-known test sets—which frequently leak into training data—the researchers used genuine user-submitted queries from Lenz’s fact-checking platform. ‘It’s highly unlikely that most of these claims appear in any training dataset paired with a definitive ground-truth label—there’s no authoritative answer key to memorize, no排行榜 to overfit toward,’ the paper states.
The statistical metric used to quantify agreement, Krippendorff’s alpha, registered a score of 0.639—where 1.0 represents perfect consensus and 0 reflects pure randomness. The study characterizes this as ‘meaningful but constrained alignment.’ ‘The model outputs exhibit a discernible pattern rather than pure noise, yet they lack the consistency needed to treat any single model as interchangeable with the others on the panel,’ the researchers observe. In research contexts, any score falling below 0.8 is typically viewed as insufficient for reliable agreement—placing these AI
A score of 0.8 indicates only a weak level of agreement.
In the rare instances where all five AI models reached the same conclusion—occurring for just 328 out of 1,000 claims—they almost never concurred that a claim was misleading or mostly true. Only four claims were unanimously labeled “misleading,” and not a single one received a consensus “mostly true” rating.
The researchers highlighted cases where the AI models diverged the most, such as the claim: “The World Bank’s active portfolio in Nigeria stands at over $16.4 billion as of 2025.” ChatGPT 5.4 classified this as “mostly true,” Gemini 3 Pro called it “false,” and its counterpart Gemini 3 Pro + Search deemed it “misleading.”
In a separate example, when given the statement “Donald Trump said that an attack on Iran was postponed at the request of Gulf Allies,” GPT-5.4 labeled it false, Claude Opus 4.7 rated it mostly true, Gemini 3 Pro called it false, and Gemini 3 Pro + Search rated it true.
“The panel converges on definitive verdicts; the middle of the rubric is where it fractures,” the researchers noted. Consensus only occurred at the two extremes—where a claim was either clearly true or clearly false.
This is significant because more and more people are relying on AI for fact-checking. If you feed the same claim from a news article into ChatGPT, Claude, or Gemini, you could end up with three different assessments. So which one should you trust?
AI companies frequently boast that their models are becoming more accurate, publishing benchmark scores that show steady progress. But the Lenz study tested these models against the kind of messy, ambiguous claims that real people actually debate—and discovered that the models disagree just as much as humans do.
The paper is careful to emphasize one key point: “A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness.”
Underlying these numbers is a deeper issue. When models disagree, at least one of them has to be wrong—what the study terms “label-inconsistent under this 4-bucket rubric.” There’s no process for settling disputes, no higher authority to appeal to. Recent coverage of AI reliability has raised similar concerns.
Among the 328 claims where all five models agreed, zero received a unanimous “mostly true” verdict. The middle ground simply vanished. If AI models can only reach consensus at the extremes, can they truly be trusted as fact checkers?
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.



