University of Florida
Overview of HETA: semantic influence (target-conditioned attention–value rollout), curvature sensitivity (Hessian-based estimates), and information gain (KL divergence under token masking) combine to produce target-specific token attributions.
Decoder-only language models generate impressive text, but their decisions are difficult to interpret. Most existing attribution methods were designed for encoder-style architectures and rely on local, first-order approximations, which struggle to capture the causal and semantic structure of autoregressive generation. As a result, their explanations can be unstable and misaligned with true token influence.
HETA addresses this limitation with a token-level attribution framework tailored to decoder-only language models. It combines three complementary views of importance: (i) a semantic transition component that traces attention–value flows ending at the target token, enforcing a causal gate over tokens that can influence the prediction; (ii) a Hessian-based sensitivity term that captures second-order curvature of the target log-likelihood, revealing nonlinear and interaction effects beyond gradients; and (iii) an information-theoretic term that measures how the predictive distribution changes when individual tokens are masked. Together, these signals yield context-aware, causally grounded, and semantically meaningful attributions that outperform strong baselines on both benchmark datasets and curated evaluation setups.
Links to the current version of the paper and the planned open-source release.
Main manuscript describing the HETA framework, theoretical foundations, and experimental results.
Download PaperThis link will point to the official preprint or conference archive once it is public.
Archive link – coming soonThe official implementation of HETA, including evaluation pipelines and reproducing the reported experiments, will be released after camera-ready.
Code – coming soonLive demo of HETA via Gradio, embedded into this page.
The demo allows users to provide an input sequence, choose a target token, and visualize token-level attributions. Tokens are coloured by their HETA score, with darker green indicating higher influence.
You can try the demo in the box beside, or you can open the full demo page in a new tab here .
The live demo can be embedded directly below so visitors can interact with HETA without leaving this page.
Why attention maps and first-order gradients are not enough for decoder-only language models.
Attention-based explanations show where a model “looks” but not necessarily what truly drives its predictions. Attention weights can be rearranged or perturbed without significantly changing the output, and aggregated attention across heads and layers often mixes direct and indirect influence in ways that are hard to interpret.
First-order methods such as plain gradients, Input×Gradient, or Integrated Gradients approximate influence by local linear sensitivity. In highly nonlinear regions or flat regimes of the activation function, gradients can vanish even when finite perturbations to a token still cause meaningful changes in the output distribution. In autoregressive models, where each token is generated conditioned on a long and context-dependent history, these issues become more severe.
HETA is motivated by the need for attributions that respect causal structure, capture higher-order effects, and reflect how the output distribution actually changes when context tokens are perturbed. The framework is designed to provide stable, faithful, and interpretable explanations across prompts, models, and decoding hyperparameters.
HETA decomposes token influence into semantic flow, curvature-based sensitivity, and information-theoretic impact.
HETA first traces attention–value flows that terminate at the target position under the decoder’s causal mask. This produces a semantic transition vector that assigns non-negative mass only to tokens that lie on valid paths to the target. It acts as a causal gate: only tokens that can structurally influence the target are eligible to receive attribution.
This component respects the temporal and structural constraints of autoregressive decoding and ensures that attributions are only assigned to context tokens that could have influenced the target via the network’s attention pathways.
To capture nonlinear effects, HETA estimates token-wise sensitivity from Hessian–vector products using a Hutchinson estimator, avoiding explicit construction of the full Hessian. In parallel, it measures how the target distribution changes when each token is masked, via KL divergence between the original and perturbed predictions.
The final score multiplies the causal gate with a weighted combination of curvature and information terms, yielding a target-conditioned importance measure for each token that reflects both local and global influences on the prediction.
The overall pipeline yields non-negative, target-specific attribution scores for every token in the context. By combining structural (causal paths), geometric (curvature), and information-theoretic (KL divergence) views, HETA provides explanations that are both theoretically grounded and empirically robust for decoder-only language models.
This design allows HETA to faithfully highlight tokens that are truly responsible for a given prediction, rather than merely correlating with attention weights or gradient magnitudes. As a result, it produces more stable and interpretable attribution maps across tasks and model scales.
Summary of experimental findings and an example visualization of HETA attributions.
Example of HETA token-level attribution: tokens are coloured by their importance for a chosen target position, with darker green indicating higher attribution.
On benchmark datasets such as LongRA, TellMeWhy, and WikiBio, HETA achieves higher Soft-NC and Soft-NS scores than gradient-based, attention rollout, and recent attribution baselines. The improvements are especially strong for long-range reasoning tasks, where capturing deep context dependencies is crucial.
These results indicate that HETA’s scores more accurately reflect which tokens truly matter for the target prediction, rather than simply tracking surface-level correlations or noisy gradient signals.
A curated evaluation set combining narrative and science QA passages is used to probe whether attribution mass concentrates on truly diagnostic evidence. Using the Dependent Sentence Attribution (DSA) metric, HETA substantially outperforms all baselines, while also showing higher stability under input noise, syntactic rephrasings, and changes in decoding hyperparameters.
Together, these findings suggest that HETA provides attributions that are more closely aligned with human judgments about which parts of the context support a given answer, and that its explanations are robust to small perturbations in the prompt or decoding setup.
Citation information will be posted here once the paper is publicly available.
BibTeX entry – coming soon.