Abstract

Overview of the HETA framework

Overview of HETA: semantic influence (target-conditioned attention–value rollout), curvature sensitivity (Hessian-based estimates), and information gain (KL divergence under token masking) combine to produce target-specific token attributions.

Decoder-only language models generate impressive text, but their decisions are difficult to interpret. Most existing attribution methods were designed for encoder-style architectures and rely on local, first-order approximations, which struggle to capture the causal and semantic structure of autoregressive generation. As a result, their explanations can be unstable and misaligned with true token influence.

HETA addresses this limitation with a token-level attribution framework tailored to decoder-only language models. It combines three complementary views of importance: (i) a semantic transition component that traces attention–value flows ending at the target token, enforcing a causal gate over tokens that can influence the prediction; (ii) a Hessian-based sensitivity term that captures second-order curvature of the target log-likelihood, revealing nonlinear and interaction effects beyond gradients; and (iii) an information-theoretic term that measures how the predictive distribution changes when individual tokens are masked. Together, these signals yield context-aware, causally grounded, and semantically meaningful attributions that outperform strong baselines on both benchmark datasets and curated evaluation setups.

Paper Archive & Code

Links to the current version of the paper and the planned open-source release.

Paper PDF

Main manuscript describing the HETA framework, theoretical foundations, and experimental results.

Download Paper
arXiv / Archive

This link will point to the official preprint or conference archive once it is public.

Archive link – coming soon
Code

The official implementation of HETA, including evaluation pipelines and reproducing the reported experiments, will be released after camera-ready.

Code – coming soon

Interactive Demo: Input / Output

Live demo of HETA via Gradio, embedded into this page.

Try HETA Online

The demo allows users to provide an input sequence, choose a target token, and visualize token-level attributions. Tokens are coloured by their HETA score, with darker green indicating higher influence.

  • Type or paste a prompt for a decoder-only language model.
  • Select the index or position of the target token.
  • Inspect how HETA distributes attribution across the context.

You can try the demo in the box beside, or you can open the full demo page in a new tab here .

Embedded Demo

The live demo can be embedded directly below so visitors can interact with HETA without leaving this page.

Motivation

Why attention maps and first-order gradients are not enough for decoder-only language models.

Attention-based explanations show where a model “looks” but not necessarily what truly drives its predictions. Attention weights can be rearranged or perturbed without significantly changing the output, and aggregated attention across heads and layers often mixes direct and indirect influence in ways that are hard to interpret.

First-order methods such as plain gradients, Input×Gradient, or Integrated Gradients approximate influence by local linear sensitivity. In highly nonlinear regions or flat regimes of the activation function, gradients can vanish even when finite perturbations to a token still cause meaningful changes in the output distribution. In autoregressive models, where each token is generated conditioned on a long and context-dependent history, these issues become more severe.

HETA is motivated by the need for attributions that respect causal structure, capture higher-order effects, and reflect how the output distribution actually changes when context tokens are perturbed. The framework is designed to provide stable, faithful, and interpretable explanations across prompts, models, and decoding hyperparameters.

Method Overview

HETA decomposes token influence into semantic flow, curvature-based sensitivity, and information-theoretic impact.

Semantic Transition Influence

HETA first traces attention–value flows that terminate at the target position under the decoder’s causal mask. This produces a semantic transition vector that assigns non-negative mass only to tokens that lie on valid paths to the target. It acts as a causal gate: only tokens that can structurally influence the target are eligible to receive attribution.

This component respects the temporal and structural constraints of autoregressive decoding and ensures that attributions are only assigned to context tokens that could have influenced the target via the network’s attention pathways.

Curvature & Information Gain

To capture nonlinear effects, HETA estimates token-wise sensitivity from Hessian–vector products using a Hutchinson estimator, avoiding explicit construction of the full Hessian. In parallel, it measures how the target distribution changes when each token is masked, via KL divergence between the original and perturbed predictions.

The final score multiplies the causal gate with a weighted combination of curvature and information terms, yielding a target-conditioned importance measure for each token that reflects both local and global influences on the prediction.

Putting It All Together

The overall pipeline yields non-negative, target-specific attribution scores for every token in the context. By combining structural (causal paths), geometric (curvature), and information-theoretic (KL divergence) views, HETA provides explanations that are both theoretically grounded and empirically robust for decoder-only language models.

This design allows HETA to faithfully highlight tokens that are truly responsible for a given prediction, rather than merely correlating with attention weights or gradient magnitudes. As a result, it produces more stable and interpretable attribution maps across tasks and model scales.

Results

Summary of experimental findings and an example visualization of HETA attributions.

Example of HETA token-level attribution visualization

Example of HETA token-level attribution: tokens are coloured by their importance for a chosen target position, with darker green indicating higher attribution.

Attribution Faithfulness

On benchmark datasets such as LongRA, TellMeWhy, and WikiBio, HETA achieves higher Soft-NC and Soft-NS scores than gradient-based, attention rollout, and recent attribution baselines. The improvements are especially strong for long-range reasoning tasks, where capturing deep context dependencies is crucial.

These results indicate that HETA’s scores more accurately reflect which tokens truly matter for the target prediction, rather than simply tracking surface-level correlations or noisy gradient signals.

Alignment & Robustness

A curated evaluation set combining narrative and science QA passages is used to probe whether attribution mass concentrates on truly diagnostic evidence. Using the Dependent Sentence Attribution (DSA) metric, HETA substantially outperforms all baselines, while also showing higher stability under input noise, syntactic rephrasings, and changes in decoding hyperparameters.

Together, these findings suggest that HETA provides attributions that are more closely aligned with human judgments about which parts of the context support a given answer, and that its explanations are robust to small perturbations in the prompt or decoding setup.

BibTeX

Citation information will be posted here once the paper is publicly available.

BibTeX entry – coming soon.