License: CC BY 4.0
arXiv:2403.05767v1 [cs.LG] 09 Mar 2024

Extending Activation Steering to Broad Skills and Multiple Behaviours

Teun van der Weij
Utrecht University &Massimo Poesio
Utrecht University &Nandi Schoots
King’s College London
Correspondence to: [email protected]
Abstract

Current large language models have dangerous capabilities, which are likely to become more problematic in the future. Activation steering techniques can be used to reduce risks from these capabilities. In this paper, we investigate the efficacy of activation steering for broad skills and multiple behaviours. First, by comparing the effects of reducing performance on general coding ability and Python-specific ability, we find that steering broader skills is competitive to steering narrower skills. Second, we steer models to become more or less myopic and wealth-seeking, among other behaviours. In our experiments, combining steering vectors for multiple different behaviours into one steering vector is largely unsuccessful. On the other hand, injecting individual steering vectors at different places in a model simultaneously is promising.

The source code along with the findings can be accessed at https://github.com/TeunvdWeij/extending-activation-addition.

1 Introduction

Large language models have numerous unwanted traits and dangerous capabilities, which are expected to become more problematic in the (near) future (Shevlane et al.,, 2023). We hope that these harmful capabilities can be mitigated. Some successes already exist, such as prompting (Liu et al.,, 2023) and Reinforcement Learning from Human Feedback (Christiano et al.,, 2017), but they do face various difficulties (Casper et al.,, 2023). To overcome some of these difficulties, activation steering methods have been developed (Zou et al.,, 2023). These techniques change the activations of a model during inference to steer the output of the model. This is in contrast to weight editing techniques (Pochinkov and Schoots,, 2023; Foster et al.,, 2023), which permanently change the model (Shaik et al.,, 2023). Activation steering involves two steps:

Activation generation

We run the model on certain inputs exemplifying a task, and store the activations of the model for each input. If this step is successful, combining these activation vectors leads to a steering vector representing the target behaviour.

Activation injection

We add or subtract the activations from the generation step during inference, regulated by an injection coefficient.

There are many ways to do generation and injection, and related work has variations in both. Our variant of activation steering is based on Activation Addition (Turner et al.,, 2023) and Contrastive Activation Addition (Rimsky et al.,, 2023). Activation steering methods have managed to make language models more truthful, honest, and power averse, among other properties (Turner et al.,, 2023; Rimsky et al.,, 2023; Zou et al.,, 2023; Li et al.,, 2023). Importantly, little to no negative effect on the general performance is reported in these experiments. However, so far, experiments have only focused on steering individual behaviours or relatively narrow skills.

In this paper we investigate the following question: Can we extend activation steering to broad skills and multiple behaviours?

We hypothesize that extending activation addition to steering towards broad skills or multiple behaviours will lead to smaller effect sizes. This is because it may be hard to find a direction in the latent space of a large language model that relates to a broader skill (broad steering) or multiple behaviours (multi-steering), if it exists at all. For example, if the direction of ‘truthfulness’ and of ‘love’ are combined, this will likely reduce the steering quality of each independent behaviour. As for a broader skill, e.g. coding ability may not correspond to a single activation direction, but rather may combine a variety of narrow skills. More concretely, if an activation (pattern) is crucial for one skill while simultaneously being detrimental for another skill, then the steering’s quality will diminish.

When performing multi-steering, instead of generating one combined steering vector, an alternative approach could be to steer at multiple places in the model simultaneously. However, we hypothesize that this may lead to a different issue, namely it may lead to interaction effects. For example, suppose we steer both at layer 10 and at layer 11. The steering vector of layer 11 will be injected into different activations than is typical due to steering at layer 10, and vice versa the steering vector of layer 10 will affect the output differently due to subsequent steering at layer 11. Therefore, we expect interaction effects which likely reduce steering quality.

One important consideration when steering a language model is that the model can be rendered ineffective by heavily changing the activations during inference. Consequently, a silly solution to the problem of removing an unwanted skill is to drastically reduce general model performance. This trade-off is called the alignment tax on model performance (Leike,, 2022). To align models, one of course desires that the alignment tax is minimal. Therefore, we keep track of the alignment tax during our experiments.

Related work has found that the alignment tax diminished only slightly, if at all, with activation steering in general (Rimsky et al.,, 2023; Turner et al.,, 2023; Zou et al.,, 2023; Li et al.,, 2023). However, we hypothesize that extending activation addition will lead to a larger alignment tax. Changing activations affects their normal use, and such deviations are expected to be performance decreasing by default, as the part of latent space with worse predictions is much larger than the space of better predictions. As larger parts of the model’s output are being targeted, a natural consequence is that general performance is more strongly impacted. For example, coding skill likely shares similar activations to the remaining skills. So if coding skill is reduced, remaining skills are also likely to be reduced.

2 Methodology

We have two groups of experiments. The first group concerns the effect of steering one broad skill, and the second investigates the effect of multi-steering. For each group, we also investigate the alignment tax. In the next sections we describe the most relevant methodology, further details are provided in Appendix A.

We perform our experiments with the Generative Pretrained Transformer model Llama 2 7b Chat (Touvron et al.,, 2023). We chose this model because it is used in related work, its weights are open source, and the model has relatively strong capabilities for its size.

2.1 Broad steering

In this group of experiments, we investigate first whether activation steering is able to relatively weaken general coding and Python-specific skill compared to the performance on regular text. Secondly, we investigate whether steering against general coding skill is associated with a higher alignment tax than Python-specific ability. This experiment method is adapted from Pochinkov and Schoots, (2023).

Datasets

The samples are taken from the Pile (Gao et al.,, 2020), which is a diverse and high-quality dataset of internet data. We split the dataset into code and text data. Note that these distinctions are not perfectly clean, there might be some code in Stack Overflow text data in the only text dataset, and, of course, there are comments and descriptions in natural language in the coding samples. We also have a dataset with only Python-specific code from the cleaned CodeParrot dataset (Tunstall et al.,, 2022).

Activation generation

We ran the model on 5000 samples of text, general code, and Python data. All samples were truncated to 4096 tokens, corresponding to the model’s context window. We calculate the three steering vectors (text, general code and Python) by averaging the values of the last token in the residual stream. Because of the masked attention in Llama 2, the last token likely contains most of the contextual information, thus representing the target behaviour best. Lastly, for each of the three steering vectors, we generate their permuted counterpart as a baseline. We permute the steering vectors per layer, which maintains the mean and the standard deviation and changes the order of the activations between the original and the permuted vector. The permuted steering vectors therefore change the activations with the same total amount as the original steering vector, but they distort other activations in the model.

Activation injection

We subtracted code or Python steering vectors during inference for a range of injection coefficients, where 0 is equal to the model’s normal behaviour. Additionally, we add a text steering vector representing the performance we do not want to reduce, which Jorgensen et al., (2023) found to be beneficial. Therefore, we calculate the steering vector by adding the text activations and subtracting the code or Python activations at the same place in the model. Notably, we do not normalize the steering vectors, as this affords using similar injection coefficients for each layer. See Appendix A.2 for more information on the used injection coefficients.

Evaluation

We calculate the model’s performance for a certain injection coefficient relative to the model’s normal behaviour for text and code/Python data to keep track of the alignment tax. We use top-1 token prediction accuracy, which is “the fraction of times, over many predictions, that the predictor assigns the highest probability to the correct next token" (Shlegeris et al.,, 2022). For each injection coefficient, we evaluate the model’s performance on 500k text tokens and 500k code/Python tokens. The datasets are initially randomized, but the same samples are evaluated for each layer.

2.2 Multi-steering

In this section, we outline the three experiments examining whether we can steer for multiple behaviours. We first establish a baseline by steering each behaviour individually. In our combined steering experiment, we combine the individual steering vectors into one steering vector in various ways. Lastly, in our simultaneous steering experiment, we inject the individual steering vectors at different places in the model simultaneously.

Datasets

Figure 1: Sample template.

Human:
<question>

AI:
Out of (A) and (B), I choose (

Human:
Would you rather have a 8% chance of winning $100 today, or a 12% chance of winning $200 in a month?

Choices:
(A) A 8% chance of winning $100 today
(B) A 12% chance of winning $200 in a month

AI:
Out of (A) and (B), I choose (

Figure 1: Sample template.
Figure 2: Example question from the myopia dataset.

We use 5 different datasets containing binary (yes/no or A/B) questions, taken from Perez et al., (2022). The five datasets are about anti-immigration, agreeableness, myopia, sycophancy (on political topology), and wealth seeking. The specific datasets are selected for having 1000 or more samples and for ease of use. For each of these samples, one of the binary answers corresponds to the behaviour. The datasets for wealth seeking and myopia were human generated, and the rest generated by language models. The quality of these datasets are comparable, as stated in Perez et al., (2022). For each behaviour we use 1000 samples, and for wealth seeking 985 due to lacking samples in the original dataset. We randomly select 500 samples to generate activations, 200 for validation, and the remaining samples for testing. The sample formatting is displayed in Figure 2. The "A" and "B" are replaced by "Yes" and "No" when appropriate. This template and especially the open bracket at the end indicate the options of what the next token should be, affording easy processing of the answers. A myopia example is shown in 2. Here, the matching answer is A.

Activation generation

We generated the activations for the required layers by extracting the activations from the last token in the residual stream, the same as in Section 2.1. However, here we use Contrastive Activation Addition (Rimsky et al.,, 2023). To get these steering vectors, we ran the model on the prompt matching the behaviour and on the prompt not matching the behaviour. The prompts were created by appending the corresponding answer to the template above. We got the activations for both prompts, and subtracted the non-matching activations from the matching activations for the 500 samples in the training split. This contrastive approach has been found effective in steering behaviours (Rimsky et al.,, 2023; Zou et al.,, 2023).

Activation injection

For the individual steering vectors, we did a hyperparameter grid search to find the injection coefficients and the layer to inject in. The grid search was done both for adding and subtracting the steering vector during inference. The goal of this hyperparameter sweep is to find the largest difference between the default matching score and the score as a result of steering. The matching score is the share of answers that the (steered) model gives that match the target behaviour. We consider two extra criteria, which concern the model’s functionality in answering the questions. First, if more than 5% of the output tokens do not match the possible options, the corresponding hyperparameters are discarded. Notably, we have never observed faulty answers with unsteered models. Second, steering too strongly can lead to cases where the model nearly always gives the same answer for each sample. Since the matching answers are distributed evenly between the two answer options, a heavily skewed answer distribution does not show a certain behavioural preference, but an incapable model. To avoid this mode collapse, the hyperparameter combination is discarded if one valid output token’s frequency is >95% (e.g. A occurs 3 times and B occurs 197 times).

The steering vectors were combined into one steering vector in 8 different combinations, resulting from all the possible combinations of 3 binary differences. As a start, we either multiply each steering vector by their respective injection coefficient found by the grid search, or we multiply them by 1 (the injection coefficient varies between adding and subtracting, see Table 1). Secondly, we either take the mean of these activations, or we sum them up. Thirdly, we subtract or add the combined steering vectors, for a total of 8 combined steering vectors.

Furthermore, we also steer simultaneously at different places in the model, each regulated by the same global injection coefficient. We inject the myopia steering vector in layer 11, wealth seeking in layer 12, sycophancy in 13, agreeableness in 14, and anti-immigration in 15. We do this for global injection coefficients ranging from -2 to 2 with steps of 0.05.

Evaluation

We used the matching score as metric. Too strong steering leads to faulty answers (answers not being yes/no or A/B) or mode collapse, as previously explained. For combined steering, if the model outputs too many faulty answers a score of -0.1 was given and otherwise if steering resulted in mode collapse a score of -0.2 was given.

For simultaneous steering, we calculate the matching scores for each behaviour while steering with the same vector. To measure the alignment tax, we also calculate the top-1 accuracy on 500k tokens from the Pile, containing both text and code. We calculate this top-1 accuracy while the model is being steered for each global injection coefficient.

3 Results

3.1 Broad steering

Figure 3 shows the relative coding vs textual performance for top-1 next token prediction accuracy, and Figure 4 shows this for Python-specific performance. We first describe the results in Figure 3, and afterwards compare it with Figure 4.

After subtracting the coding steering vector and adding the text steering vector for various injection coefficients, we see that activation steering works to relatively reduce coding ability for most layers. For example, a 60% relative top-1 coding accuracy (40% fewer correctly predicted tokens) corresponds to an 80% relative textual accuracy for layer 15. Because developers will likely only accept a marginal penalty on general performance, Figure 2(b), sheds light specifically on these smaller margins. We see a 10% reduction in coding performance, corresponding to only a 3% reduction in textual performance. The permuted steering vectors show a pattern similar to worse performing layers. The steering vector for layer 0 surprisingly produced the opposite of the intended effect.

We hypothesized that steering for broader concepts would work but with a smaller effect size compared to narrower steering. Figure 3(a) shows similar results to Figure 2(a), contradicting our hypothesis. In the cropped Figure 3(b), we again see a similar performance to general coding. Although the steering vectors clearly work, the selective pruning method by Pochinkov and Schoots, (2023) is substantially more effective.

Refer to caption
(a) The overall performance.
Refer to caption
(b) The same as in (a), but cropped to the top scores.
Figure 3: This figure illustrates the effect of applying a steering vector aiming to remove coding ability. The scores for text and code data are recorded. Each line represents steering at a certain layer, and each dot represents the scores for one injection coefficient. The horizontal dotted line illustrates perfect performance (which is impossible for our data due to noise), and the diagonal dotted line shows equal performance drops in code and text.
Refer to caption
(a) Subfigure Python ability
Refer to caption
(b) Python ability cropped.
Figure 4: This figure illustrates the effect of applying a steering vector aiming to remove Python ability, in the same way as in Figure 3.

3.2 Multi-steering

Refer to caption
Figure 5: The results for individual steering for layer 15. The used injection coefficients can be found in Table 1

Here we show the results for layer 15, see Appendix B for the results for layer 10. In Figure 5 we see that for myopia, wealth seeking, agreeableness, and anti-immigration individual steering works in both directions but with varying effect sizes. Interestingly, steering for sycophancy has a negligible effect in our experiment, in contrast with previous work (Rimsky et al.,, 2023). The difference is likely due to our focus on political topology, and not other forms of sycophancy.

The results of the 8 combined steering vectors are shown in Figure 6. We observe that combined steering leads to unexpected and often smaller effect sizes than steering individually. The smaller effect sizes are clearly visible for the first three combinations for addition and subtraction. The last combination for myopia demonstrates unexpected effects: the weighted summation steering vector for addition has a larger effect size than with individual steering, but the subtracting weighted summation for subtraction leads to a higher matching score than without steering at all. Moreover, for anti-immigration, we see that adding generally leads to lower matching scores and subtracting to higher matching scores, which is also in contrast to expectation. For wealth seeking in particular, none of the combined steering vectors maintained a substantial effect. We see that combined steering vector leads to mode collapse in some cases, indicating an increased alignment tax.

Refer to caption
Figure 6: Combining the individual steering vectors into one injected in layer 15. The combinations differ in three dimensions: take the mean or sum, weighted or unweighted, and subtracted or added. We compare the combined steering to the individual steering presented in Figure 5, which are indicated with the grey horizontal lines.

In Figure 7 we show the effect of simultaneous steering at multiple layers of the residual stream at the last token. For myopia and wealth seeking behaviours, the effects are comparable to individual steering. For anti-immigration and agreeableness the effect sizes are smaller and more unstable than with individual steering. For anti-immigration the steering only works for small injection coefficients, and for agreeableness the opposite effect occurs for small injection coefficients. As expected based on individual steering, sycophancy steering does not work here either. Additionally, as the absolute global injection coefficients increase, we see the effect of mode collapse arising. The scores converge to the dotted brown horizontal line at 0.5, which is the score when each answer has the same token, e.g. ‘Yes’. The alignment tax appears to be minor; there is only a couple percentage points decrease for global injections coefficients of -1 and +1.

Refer to caption
Figure 7: The results for simultaneous steering. Each individual steering vector is injected at a different layer, according to the same global injection coefficient on the x-axis. The score is the matching score for the 5 behaviours, and is the top-1 prediction accuracy score for the alignment tax. A piece of line missing indicates that there were more than 5% faulty responses.

4 Discussion

4.1 Broad steering

In Figure 3 and Figure 4 we see that activation steering can work for broad skills (coding ability) and is competitive with steering towards narrower skills (Python ability).

One possible explanation for this counter-intuitive result is that the steering vectors for Python and general coding roughly equally distort the model’s workings. However, the spread of the activation distribution in Appendix C seems to provide evidence against this explanation, as the spread of Python activations is smaller than the general coding steering vector for layer 15. Steering vectors are also multiplied by the same injection coefficients, so this is not a factor either.

Another explanation for the result that activation steering works well for broad skills could be that Python data constituted a large part of the overall coding data for Llama 2, and therefore the results are highly correlated. This explanation cannot be easily verified or refuted because the training data of Llama 2 is private. Additional experiments removing programming languages other than Python might be informative here.

Furthermore, the opposite effect of steering at layer 0 (where steering to remove coding ability leads to an improvement in coding) is confusing. One speculative theory comes from a paper introducing the concept of copy suppression, which states that “[i]f components in earlier layers predict a certain token, and this token appears earlier in the context, the attention head suppresses it” (McDougall et al.,, 2023). It could be that coding activations do not occur in earlier layers due to steering, and therefore the copy suppression in subsequent layers does not work. More research is needed to support or refute this theory.

4.2 Multi-steering

In line with previous work (Rimsky et al.,, 2023; Turner et al.,, 2023; Zou et al.,, 2023; Li et al.,, 2023), we find that steering vectors for individual behaviours are effective. Below we discuss our findings for steering multiple behaviours.

Combined Steering

In Figure 6 we show that combining these individual steering vectors into one steering vector is less successful: we only find substantial effect sizes in the desired direction for myopia. This overall reduced steerability result aligns with our hypothesis. However, it is possible that another method for combining steering vectors or other hyperparameter settings lead to a larger and more reliable effect. At least, we have illustrated that combined steering is not straightforward. In particular, the easiest methods for combining vectors result in ineffective steering vectors.

Simultaneous Steering

In Figure 7 we find that simultaneous injection of individual steering vectors at different places in the model appears more effective than combined steering. In particular, we find that we can substantially and reliably steer two behaviours (myopia and wealth seeking). For agreeableness and anti-immigration behaviours we find a minor and less reliable effect. This increased steerability is likely due to a lower disturbance of the activation pattern of each individual behaviour. Moreover, our results suggest that interaction effects (between steering at different layers) do not substantially reduce the steering effect. Therefore, interaction effects from simultaneous steering appear less problematic than the changed direction in the latent space with combined steering. Surprisingly, simultaneous steering merely leads to a marginal alignment tax.

All in all, simultaneous steering seems like a more promising method than combined steering.

4.3 General discussion

The flexibility of activation steering is a double-edged sword. The large variety of activation generation and injection techniques affords broad applicability, but finding an optimal setup is not straightforward. This could be due to the novelty of the method, as best practices have not yet been established. This large variety is also mentioned in the original Activation Addition paper (Turner et al.,, 2023). As a result, the claim that activation steering does not work well is not a certain claim; the possibility exists that a slightly different approach would be fruitful.

Moreover, as we care about reducing risks from models, we need to know the ‘real’ performance of models, not what matching score they achieve. Related work has illustrated that the matching does translate to open-ended generation (Rimsky et al.,, 2023), which indicates that matching score might be a reasonable proxy for ‘real’ performance. However, it is still unclear whether e.g. a 20% reduction in top-1 coding accuracy would actually reduce risks when coding ability of a model is risky.

4.4 Future work

To further investigate broad steering capabilities, more narrow skills can be investigated. For example, if Python makes up a large part of the coding data in the Llama 2 training set, then steering against alternative programming languages may lead to more interesting effects.

To perform simultaneous steering (see Figure 7) we only vary one global injection coefficient. We expect that our results can be improved by using different injection coefficients for different steering vectors. Furthermore, the steering vectors can be generated and injected at different places within a layer, such as in the attention and MLP components. This might allow us to: 1) find areas that correspond more specifically to a target skill or behaviour; and 2) find distinct areas for different target behaviours so that we can use simultaneously steer for even more concepts.

Future work can also focus on testing how these results hold up with other models. This will provide insight in the general working of activation steering. It seems plausible that sparse language models might be easier to steer than dense language models such as Llama 2, as the activation patterns might be cleaner. Sparse steering is especially promising when extending activation addition due to a reduction in the clashing of activation patterns in individual steering vectors.

Contributions

Teun van der Weij was the main author of this paper. Nandi Schoots and Massimo Poesio supervised the project, with Nandi Schoots supervising for longer and more intensively.

Acknowledgments

We want to thank numerous people for their research ideas, code samples and feedback in general: Nina Rimsky, Nicky Pochinkov, Alex Jackson, Andrea Bruera, Ole Jorgensen, and Nadja Flechner. We also want to thank the research engineering team from Utrecht University for the high-performance cluster support.

References

  • Casper et al., (2023) Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., et al. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  • Christiano et al., (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  • Foster et al., (2023) Foster, J., Schoepf, S., and Brintrup, A. (2023). Fast machine unlearning without retraining through selective synaptic dampening. arXiv preprint arXiv:2308.07707.
  • Gao et al., (2020) Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. (2020). The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  • Jorgensen et al., (2023) Jorgensen, O., Cope, D., Schoots, N., and Shanahan, M. (2023). Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813.
  • Leike, (2022) Leike, J. (2022). Distinguishing three alignment taxes.
  • Li et al., (2023) Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. (2023). Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
  • Liu et al., (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  • McDougall et al., (2023) McDougall, C., Conmy, A., Rushing, C., McGrath, T., and Nanda, N. (2023). Copy suppression: Comprehensively understanding an attention head. arXiv preprint arXiv:2310.04625.
  • Perez et al., (2022) Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., et al. (2022). Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
  • Pochinkov and Schoots, (2023) Pochinkov, N. and Schoots, N. (2023). Dissecting large language models. In Socially Responsible Language Modelling Research.
  • Rimsky et al., (2023) Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. (2023). Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681.
  • Shaik et al., (2023) Shaik, T., Tao, X., Xie, H., Li, L., Zhu, X., and Li, Q. (2023). Exploring the landscape of machine unlearning: A survey and taxonomy. arXiv preprint arXiv:2305.06360.
  • Shevlane et al., (2023) Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung, M., Kolt, N., Ho, L., Siddarth, D., Avin, S., Hawkins, W., Kim, B., Gabriel, I., Bolina, V., Clark, J., Bengio, Y., Christiano, P., and Dafoe, A. (2023). Model evaluation for extreme risks.
  • Shlegeris et al., (2022) Shlegeris, B., Roger, F., Chan, L., and McLean, E. (2022). Language models are better than humans at next-token prediction. arXiv preprint arXiv:2212.11281.
  • Touvron et al., (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Tunstall et al., (2022) Tunstall, L., Von Werra, L., and Wolf, T. (2022). Natural language processing with transformers. " O’Reilly Media, Inc.".
  • Turner et al., (2023) Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. (2023). Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.
  • Zou et al., (2023) Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. (2023). Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.

Appendix A Detailed methodology

Here are some additional details to the methodology sections.

A.1 Methodology relating to all experiments

Truncation

All samples were truncated to 4096 tokens.

Model details

The model’s dtype was bfloat16.

Datasets

All datasets were initially randomly shuffled with seed 13.

Text generation

All generated text was produced without sampling.

A.2 Broad steering experiments

We used the following injection coefficients:

0.0,0.25,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,0.00.250.50.60.70.80.91.01.11.21.31.41.51.61.71.8\displaystyle 0.0,0.25,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,0.0 , 0.25 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 1.0 , 1.1 , 1.2 , 1.3 , 1.4 , 1.5 , 1.6 , 1.7 , 1.8 ,
1.9,2.0,2.25,2.5,2.75,3,3.5,4,4.5,5,6,7,8,9,10,15,20,30,40,501.92.02.252.52.7533.544.556789101520304050\displaystyle 1.9,2.0,2.25,2.5,2.75,3,3.5,4,4.5,5,6,7,8,9,10,15,20,30,40,501.9 , 2.0 , 2.25 , 2.5 , 2.75 , 3 , 3.5 , 4 , 4.5 , 5 , 6 , 7 , 8 , 9 , 10 , 15 , 20 , 30 , 40 , 50

There are some criteria for which we did not stepwise go through these values. We calculate the steered top-1 accuracy score divided by the default score. Based on this relative score, we used the following strategy to go through the injection coefficients. The score of either general or Python code was below 0.05, the run was stopped. Otherwise, if the relative score was below 0.15, a step size of 5 would be taken instead of 1.

A.3 Multi-steering experiments

Grid search

Extending on the grid search described in Section 2.2. The tested injections coefficients values were {0.5,1,2,3,5,10,20,30,40,60,80,120,200,300}0.51235102030406080120200300\{0.5,1,2,3,5,10,20,30,40,60,80,120,200,300\}{ 0.5 , 1 , 2 , 3 , 5 , 10 , 20 , 30 , 40 , 60 , 80 , 120 , 200 , 300 }, and the tested layers were {0,5,10,15,20,25,29,31}05101520252931\{0,5,10,15,20,25,29,31\}{ 0 , 5 , 10 , 15 , 20 , 25 , 29 , 31 }. The specific injection coefficients per behaviour are shown in Table 1 for layers 10 and 15, the two most effective layers.

Agreeableness Anti Immigration Myopic Wealth seeking Sycophancy
Layer 10 0.5, -3 3, -1 10, -1 1, -2 1, -20
Layer 15 0.5, -1 1, -0.5 2, -1 1, -2 2, -5
Table 1: The injection coefficients for each concept after performing grid search for adding and subtracting the steering vectors for layers 10 and 15.

Appendix B Layer 10 results

B.1 Single steering

Refer to caption
Figure 8: The results for individual steering for layer 10. The used injection coefficients can be found in Table 1

B.2 Combined steering

Refer to caption
Figure 9: Combining the individual steering vectors into one injected in layer 10. The combinations differ in three dimensions: take the mean or sum, weighted or unweighted, and subtracted or added. We compare the combined steering to the individual steering presented in Figure 8, which are indicated with the grey horizontal lines.

Appendix C Activation distribution

C.1 General coding activation distributions

Refer to caption
Figure 10: The distribution of activations for layer 15 given multiple datasets.

C.2 Multi steering activation distributions

Refer to caption
Figure 11: The distribution of activations for layers 10 (above) and 15 (below) for numerous concepts.