Fact or Hallucination? | Shapley NEAR

Abstract

Overview of the Shapley NEAR framework for hallucination detection

Overview of Shapley NEAR: norm-based attention outputs are converted to entropy-based information gain across layers and heads, which is then fairly attributed to context sentences via Shapley values to assess hallucination risk.

Large language models can confidently generate incorrect answers, which is risky in safety-critical applications. Many existing hallucination detectors only use final-layer logits or post-hoc textual checks, ignoring the semantic structure encoded in intermediate attention blocks. Shapley NEAR addresses this by defining an entropy-based attribution framework over all layers and heads, grounded in V-usable information and Shapley value theory.

The method converts norm-based attention outputs into head- and layer-wise information gain, comparing entropy with and without context. This information is then decomposed into sentence-level contributions using Shapley values, producing a NEAR score that serves as a confidence signal for the model’s answer: high NEAR scores indicate that context genuinely reduced uncertainty, while low scores flag likely hallucinations.

Shapley NEAR further distinguishes between parametric hallucinations (the model’s pre-trained knowledge overriding context) and context-induced hallucinations (misleading context spuriously boosting confidence), and supports a test-time head clipping strategy to disable attention heads that consistently behave in a hallucination-prone, context-agnostic way.

Paper Archive & Code

Links to the latest version of the paper, public preprint, and the planned codebase.

Paper PDF

Complete manuscript describing Shapley NEAR: definitions, theoretical properties, experimental setup, and extensive ablations.

Download Paper

arXiv / Archive

Once public, this link can point to the arXiv preprint or conference proceedings version of the work.

View Archive

Code

Planned open-source release implementing NEAR, sentence-level Shapley attribution, and head clipping on top of popular LLMs.

Code – coming soon

Interactive Demo (Optional)

This section can host a live NEAR demo (e.g., via Gradio or Streamlit) showing how NEAR flags hallucinated answers for question–context pairs.

Demo Description

A typical demo would let users input a context passage and question, query a selected LLM (such as Qwen2.5-3B, LLaMA3.1-8B, or OPT-6.7B), and then visualize the NEAR score alongside sentence-level attributions. Sentences with high contribution to information gain are highlighted.

Paste or select a context and question.
Generate the model’s answer and compute NEAR.
View sentence-level Shapley NEAR scores and a global NEAR confidence score.

Try the live demo here .

Embedded Demo

You can embed a live NEAR demo directly below.

Motivation

Why entropy-based, attention-wise usable information is needed for hallucination detection.

LLM hallucinations emerge when models output fluent but incorrect statements with high confidence, especially problematic in domains where factual accuracy is crucial. Token-level entropy signals and semantic diversity of multiple generations can be useful, but extending them cleanly to sentence-level decisions in autoregressive models is non-trivial, and they typically overlook what happens inside the network.

Prior work on V-usable information and pointwise V-information showed that classical mutual information tends to overestimate how much signal a computationally bounded model can actually exploit. At the same time, analyses of transformer internals revealed that feed-forward layers often encode superficial correlations, while attention heads are more aligned with in-context reasoning.

Shapley NEAR brings these threads together: it focuses on attention outputs, measures how much they reduce entropy relative to a null context, and then attributes this usable information to individual context sentences. This gives an interpretable, plug-and-play signal that correlates with whether an answer is supported by the context or likely to be hallucinated.

Method Overview

From norm-based attention information to sentence-level Shapley NEAR scores and head clipping.

Norm-Based Attention Information & Information Gain

For each layer ℓ and head h, the model computes query, key, and value tensors over the concatenated context and question. The attention output for head (ℓ, h) is projected to the model dimension, and the vector at the final question token is extracted. Its norm is used as a proxy for information carried by that head, and a softmax over this vector defines a vocabulary distribution.

Entropy of this distribution is measured with and without the context (null input), and the difference defines an information gain for that head. Summing over all layers and heads yields a total information gain IG(x → q) capturing how much the context reduces predictive uncertainty for the question.

Sentence-Level Shapley Attribution & NEAR

The context is segmented into sentences, and the total information gain is viewed as a cooperative “game” between these segments. Using Shapley values, NEAR computes the average marginal contribution of each sentence across all permutations of the remaining sentences.

The Shapley NEAR score is the average of these sentence-level Shapley values. It is bounded by a function of the number of layers, heads, and vocabulary size, symmetric across sentences with identical contributions, and empirically monotone when more layers are included. An AME-style estimator with randomly sampled coalitions provides a practical approximation with proven error bounds.

Parametric vs Context-Induced Hallucinations & Head Clipping

NEAR also helps distinguish different failure modes. When a context sentence that does not contain the answer reduces information gain (negative Shapley value), it signals parametric hallucination: the model’s internal knowledge conflicts with the context and increases uncertainty. When such a sentence spuriously raises information gain (positive contribution), it indicates context-induced hallucination where misleading context overly boosts confidence.

By tracking head-wise contributions, NEAR identifies attention heads that consistently show strongly negative information gain. Clipping these heads at test time improves hallucination detection and answer quality (e.g., higher AUROC, accuracy, and ROUGE-L on CoQA with LLaMA3.1-8B), demonstrating how an interpretability-based analysis can directly inform low-cost, training-free interventions.

Results

Performance on QA benchmarks, Shapley ablations, threshold analysis, and head clipping, along with concrete examples of NEAR on answerable and unanswerable questions.

Positive NEAR example where the question is answerable from the context

Positive NEAR example: the question is answerable from the context. The NEAR visualization shows high scores on the supporting sentences, indicating strong, context-grounded usable information.

Negative NEAR example where the question is not answerable from the context

Negative NEAR example: the question is not answerable from the context. NEAR assigns low scores to all sentences, signalling a high risk of hallucination despite the model's fluent response.

Hallucination Detection Performance

NEAR is evaluated on CoQA, QuAC, SQuAD v2.0 (unanswerable subset), and TriviaQA (rc-nocontext) using Qwen2.5-3B, LLaMA3.1-8B, and OPT-6.7B. Metrics include AUROC, Kendall’s τ, and Pearson correlation between NEAR scores and ground-truth answerability labels.

Across all models and datasets, NEAR consistently outperforms strong baselines, often surpassing INSIDE by about 8–13% in AUROC and by 10–15% in rank and linear correlation metrics. The best scores are obtained on SQuAD, suggesting that when the underlying QA task is easier, NEAR’s attention-wise signal translates into particularly clean separation between confident, correct answers and hallucinations.

Additional experiments on larger models (e.g., LLaMA-3.1-70B and Phi-3-Medium-14B) and long-context benchmarks such as LongRA further support NEAR’s robustness and scalability (shown in the appendix).

Shapley Aggregation, Thresholds, and Head Clipping

Replacing the Shapley aggregation with a simple greedy ranking over sentence-level information gain reduces performance. With Shapley, AUROC, Kendall’s τ, and PCC on CoQA + LLaMA3.1-8B improve substantially (e.g., AUROC from ≈0.79 to ≈0.85 and τ from ≈0.51 to ≈0.66), showing that coalition-aware attribution is important for stable rankings.

Sweeping NEAR thresholds across quantiles reveals that the first quartile (Q1) gives the most reliable separation between answerable and hallucinated cases across all datasets and models, while thresholds near 0 or far into the upper tail hurt accuracy.

Finally, clipping heads whose information gain falls below a strong negative threshold leads to additional gains. On CoQA with LLaMA3.1-8B, NEAR+Head Clipping improves AUROC and accuracy over both NEAR alone and INSIDE, and yields better ROUGE-L alignment between generated and reference answers, highlighting the practical usefulness of NEAR-driven structural interventions.

Fact or Hallucination? An Entropy-Based Framework for Attention-Wise Usable Information in LLMs