Early ideas for mechanistically investigating emergent misalignment
Background and motivation
Emergent Misalignment123 (EM) is a recently-observed phenomena in which an LLM fine-tuned on a narrowly-misaligned dataset generalizes to have broad misalignment on a wide range of questions. Examples of narrowly-misaligned datasets include code with subtle vulnerabilities and subtly bad medical advice. Examples of broad misalignment include the model expressing desires to take over and enslave the human population, have dinner with Nazi leaders, and encouraging users to kill themselves or others.
It’s no surprise that fine-tuning on misaligned data can lead to misaligned behavior. What is surprising is the generalization from a small, narrow set of misaligned samples to a wide variety of misaligned behavior some nontrivial fraction of the time (e.g. 18%). A few questions naturally emerge:
- How does the narrowly-misaligned training data lead to broad misalignment generalization?
- What has model learned during fine-tuning? How the weights of the fine-tuned misaligned model $M_{\sim A}$ differ from the weights of the base4 model $M_A$?
- Can we characterize what is different about the activations inside $M_{\sim A}$ in those ~20% of cases when it generates misaligned responses compared to when it generates aligned responses?
These questions can be formulated in terms of interpretable features:
- Which features have activations most different between the narrowly-misaligned training data (e.g. subtly-bad medical advice) and similar aligned data (e.g. good medical advice)?
- Rank-$r$ LoRA weights after fine-tuning on narrowly-misaligned data can be interpreted (Appendix A) as context-dependent feature steering with $r$ distinct linear combinations of features. What are these features?
- How do the feature activations of $M_{\sim A}$ differ when generating misaligned responses compared to aligned responses?
At first, none of these questions were answered, as the original EM paper1 contains no mechanistic interpretability analysis. In mid June, follow-ups with some interpretability analysis appeared.
Convergent Linear Representations paper
In particular, question 3 was partially addressed in the “Convergent Linear Representations” (“CLR”) paper2, in which they compare residual stream activations from $M_{\sim A}$ on aligned vs misaligned responses to the eval questions. They identify a “misalignment direction” in the residual stream vector space which on average distinguishes aligned from misaligned responses, and has the expected ablation effects when used as a steering vector (increasing misalignment in $M_A$, and decreasing misalignment in $M_{\sim A}$.)
The CLR paper also partially addressed question 2. They interpret LoRA as essentially implementing conditional feature steering in which the $A$ matrix controls when feature steering occurs and the $B$ matrix controls which features are steered and by how much. (See Appendix A for more details.)
Unfortunately the CLR paper does not decompose the misalignment direction vector or the activation directions in LoRA’s $B$ into interpretable features. Their main contribution is to support the linear representation hypothesis in the context of EM.
Original research plan to answer question 3
My original plan was to extract average feature activations differences on the same sequences used to extract the misalignment direction in CLR. This analysis would have simultaneously reproduced their misalignment direction vector and provided an explicit decomposition of it in terms of sparse feature activations. To do this analysis it would have been most ideal to have SAEs trained on the residual stream of their misaligned version of Qwen2.5-14B-Instruct. However, due to these not being available and not wanting to train my own SAEs5, the best option available would have been using the same sequences but analyzing feature activations produced by Gemma-2-9B-IT with SAEs from Gemma-2-9B-IT where available (layers 9, 20, 31) and Gemma-2-9B-PT for the other layers (see Appendix B for the assumptions behind this statement).
I was about to start this work when the OpenAI paper was released.
OpenAI paper
The OpenAI paper3 introduced SAE feature analysis in the context of question 2. They compare feature activations (from SAEs) between $M_A$ and $M_{\sim A}$ on the eval questions which elicit EM. They find that the 10 features whose activations are most increased in $M_{\sim A}$ relative to $M_A$ have a clear interpretation as representing a toxic, sarcastic, or snarky persona. Their claim is that an increase in the activations of these “persona” features lead to the broad misalignment, because this type of persona would likely provide misaligned responses across a wide range of questions.
This conclusion is a satisfying answer to question 2, and it seems highly likely that question 3 has a similar answer. Given this, it seems most interesting to turn next to question 1.
Research plan to answer question 1
To my knowledge, no paper has specifically tried to answer question 1, which focuses on properties of the training data itself.
My plan is to find the feature activations which are most different in magnitude between the narrowly-misaligned training data containing the subtly-incorrect responses (e.g. bad medical advice) and similar a similar dataset containing “correct” responses (e.g. accurate medical advice). The latter can be generated by an aligned production model, e.g. Claude.
On one hand, we might expect to find the same types of features as those identified in question 2 (e.g. by the OpenAI paper) on the basis of knowing that fine-tuning on that dataset led to those question-2 features being promoted. On the other hand, the question 2 features are associated with broad misalignment, and before the EM work was published I don’t think anyone would have expected to find such broadly-misaligned features active in such a narrowly-misaligned dataset. This tension can be resolved by explicitly finding the similarities and differences between which features are amplified on the training data vs the eval questions (the “question 2” features).
Appendix A
Rank-$r$ LoRA modifies the MLP weight matrices by adding the products of low-rank matrices:
\[W' \equiv W + \Delta W \equiv W + BA\]Where $A$ and $B$ are rank $r$. The contribution of $\Delta W$ to the activation on a specific input $x$ is:
\[\Delta x \equiv \Delta W x \equiv BAx = B_{ai}A_{ib}x_b = \sum_i^r \vec{B}_i \left(\vec{A}_i \cdot \vec{x} \right)\]In other words, Rank-$r$ LoRA adds a vector to the post-MLP residual stream which is a linear combination of $r$ other vectors $\vec{B}_i$, with input-specific coefficients which are the dot products of the $r$ vectors vectors $\vec{A}_i$ with the incoming vector $\vec{x}$ (the overhead arrow notation refers to a vector in the residual stream vector space). This was interpreted (at least in the context of $r=1$) by CLR2 as the $A$ matrix “reading from” the residual stream and controlling how much steering is done by the $B$ matrix.
Appendix B
First, I assume that the features active on the same sequence in $M_A$ are similar to those active in $M_{\sim A}$. I would have tested this assumption by first comparing the cosine similarity of the misalignment directions extracted from $M_A$ and $M_{\sim A}$ and then by doing steering and ablation studies on the misalignment direction extracted from $M_{\sim A}$. This assumption is motivated by the observation in CLR that (i) a single residual stream direction has such a profound effect in steering and ablation experiments and (ii) fine-tuning with as little as rank-1 LoRA induces EM. Both of these observations suggest that this direction likely already had similar meaning in $M_A$.
Second, I assume that nothing is special about the Qwen2.5-14B-Instruct model and that EM would have been observed in a similar Gemma-2-9B-IT model.
Third, I assume that SAEs trained on the instruct version of the model are still useful in the pre-trained model and so they can be used where SAEs from the instruct version are unavailable, as motivated by 6. This is also what is assumed in the OpenAI paper (their SAEs come from GPT-4o-base.)
Notes / References
-
Unless explicitly specified, in this note, “base model” will refer to the aligned model before narrow misalignment fine-tuning (e.g. GPT-4o) rather than the base pre-trained model before SFT or RLHF. ↩
-
I went through the exercise of training my own SAEs in Project 1. ↩
-
https://www.lesswrong.com/posts/fmwk6qxrpW8d4jvbd/saes-usually-transfer-between-base-and-chat-models ↩
Enjoy Reading This Article?
Here are some more articles you might like to read next: