Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda · Oct 23, 2025 · Citations: 0
How to use this page
Provisional trustThis page is a lightweight research summary built from the abstract and metadata while deeper extraction catches up.
Best use
Background context only
What to verify
Read the full paper before copying any benchmark, metric, or protocol choices.
Evidence quality
Provisional
Derived from abstract and metadata only.
Abstract
Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on two sets of documents describing its behavior. The first says that our model uses Python type hints during evaluation but not during deployment. The second says that our model can recognize that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model behave during evaluation as it would during deployment. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.