Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · Feb 17, 2026 · Citations: 0
How to use this page
Coverage: RecentUse this page to decide whether the paper is strong enough to influence an eval design. If the signals below are thin, treat it as background context and compare it against the stronger hub pages before making protocol choices.
Paper metadata checked
Apr 1, 2026, 2:27 PM
RecentProtocol signals checked
Apr 1, 2026, 2:27 PM
RecentSignal strength
Low
Model confidence 0.15
Abstract
Where should we intervene in a language model (LM) to localize and control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components (e.g., attention heads) from contrastive long-form responses, to steer such diffuse concepts (e.g., talk in verse vs. talk in prose). In GCM, we first construct a dataset of contrasting behavioral inputs and long-form responses. Then, we quantify how model components mediate the concept and select the strongest mediators for steering. We evaluate GCM on three behaviors--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing from and controlling the long-form responses of LMs.