When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Zachary Pedram Dadfar · Feb 11, 2026 · Citations: 0
How to use this page
Coverage: StaleUse this page to decide whether the paper is strong enough to influence an eval design. If the signals below are thin, treat it as background context and compare it against the stronger hub pages before making protocol choices.
Paper metadata checked
Feb 18, 2026, 12:06 PM
StaleProtocol signals checked
Feb 18, 2026, 12:06 PM
StaleSignal strength
Low
Model confidence 0.15
Abstract
Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.