From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem

Understanding the Target Audience

The target audience for this content includes AI researchers, data scientists, business managers, and technology decision-makers interested in the implications of language models in business applications. Their pain points often revolve around the reliability and trustworthiness of AI outputs, particularly concerning how hallucinations can mislead decision-making processes. Their goals include understanding the intricate workings of AI, mitigating risks associated with inaccurate data, and implementing robust evaluation techniques. They prefer clear, concise language that avoids jargon and focuses on practical implications and solutions.

What Makes Hallucinations Statistically Inevitable?

Research indicates that hallucinations in large language models (LLMs) arise from errors inherent to generative modeling. Even with clean training data, the cross-entropy objective used in pretraining introduces statistical pressures that lead to errors. The research simplifies the issue to a supervised binary classification task known as Is-It-Valid (IIV), which determines whether a model’s output is valid or erroneous. Findings show that the generative error rate of an LLM is at least twice its IIV misclassification rate. Hallucinations arise from factors similar to those causing misclassifications in supervised learning: epistemic uncertainty, poor models, distribution shift, or noisy data.

Why Do Rare Facts Trigger More Hallucinations?

A significant factor contributing to hallucinations is the singleton rate— the proportion of facts appearing only once in the training data. If 20% of facts are singletons, at least 20% are likely to be hallucinated. This explains the LLMs’ reliability with frequently repeated facts but their failure with obscure or rarely mentioned ones.

Can Poor Model Families Lead to Hallucinations?

Yes, hallucinations can also stem from model classes that inadequately represent patterns. For instance, n-gram models may produce ungrammatical sentences, while tokenized models might miscount letters due to hidden characters in subword tokens. These representational limits can cause systematic errors even when the underlying data is sufficient.

Why Doesn’t Post-Training Eliminate Hallucinations?

Post-training techniques, including reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and reinforcement learning from AI feedback (RLAIF), can diminish certain errors, particularly harmful or conspiratorial outputs. However, overconfident hallucinations persist due to misaligned evaluation benchmarks. Current benchmarks typically use binary scoring—correct answers earn points, while abstentions receive none, and incorrect answers face minimal penalties. This incentivizes LLMs to guess rather than express uncertainty, leading to more hallucinations.

How Do Leaderboards Reinforce Hallucinations?

Most benchmarks utilize binary grading with no partial credit for uncertainty. Consequently, models that express uncertainty tend to score lower than those that consistently guess, incentivizing developers to optimize for confident answers rather than calibrated responses.

What Changes Could Reduce Hallucinations?

Addressing hallucinations requires socio-technical changes rather than solely new evaluation frameworks. The research team advocates for explicit confidence targets in benchmarks, recommending that benchmarks penalize wrong answers and provide partial credit for abstentions. For instance, a guideline could specify: “Answer only if you are >75% confident. Mistakes lose 2 points; correct answers earn 1; ‘I don’t know’ earns 0.” This approach mirrors real-world testing formats and promotes behavioral calibration, encouraging models to abstain from answering when their confidence is below the threshold, thereby reducing overconfident hallucinations.

What Are the Broader Implications?

This research reframes hallucinations as predictable outcomes of training objectives and evaluation misalignment rather than inexplicable anomalies. Key takeaways include:

Pretraining inevitability: Hallucinations parallel misclassification errors in supervised learning.
Post-training reinforcement: Binary grading schemes incentivize guessing.
Evaluation reform: Adjusting mainstream benchmarks to reward uncertainty can realign incentives and enhance trustworthiness.

By linking hallucinations to established learning theories, the research clarifies their origins and suggests practical mitigation strategies that shift responsibility from model architectures to evaluation design.

For Further Exploration

Check out the PAPER and technical details. Explore our GitHub Page for tutorials, code, and notebooks. Follow us on Twitter for updates, and join our 100k+ ML SubReddit community. Don’t forget to subscribe to our Newsletter.