UK AISI Alignment Team and NYU PhD student
I agree that the restricted set of classifiers setting is significantly different. (Edit: To directly respond to your point that classifiers could be closed over complements. I don't think this should matter much. Bob prefers to play classifiers that are finer-grained whenever possible, so there's a preferred 'direction' for prompts. Evil classifiers are also asymmetric in that, unlike their negations, they are hard to jointly satisfy without being visibly bad--i.e. reducing reward.)
Regarding, the jump from "we incentivize evil ~1/2 the time" to "we incentivize evil ~0 of the time", I think this hinges heavily on what the requirement is. Gave this some more thought and here's how I'm currently thinking about this:
There are three possible levels of safety requirements (and corresponding assumptions on classifiers) we might be interested in. From weakest to strongest:
We can translate this into what we need to hold about the classifiers. Let's call the subset of classifiers flagging possible safety issues {S} which is a subset of {E}, Bob's action space.
My intuitions about 1&2 being not too hard come from the assumption that finding y that satisfy many E (or S respectively) at once is hard--or impossible. I have in mind questions like 'Propose improvements to my red-teaming algorithm' and then classifiers like 'Your solution must draw on paper X'. Also y that satisfy many S should be clearly bad such that your original reward or control audits block them.
I see two senses in which where research sabotage is different from the usual sandbagging issue:
ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that's not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
High-stakes, multi-shot settings: Let's say the world is such that it's feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
Thanks for the push, I previously didn't click through to your post, and after doing so I realized you're suggesting something different from what I'd assumed.
From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans--as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.
If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).
I think KL to an imitator, or base model, would be good to enforce as well! It is an action space metric though and so shares the problems we raise about action-space entropy penalties (edit: Ryan's proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.
Intuitively I expect the ideal, intractable solution of reward-proportional sampling to be 1-eps good (similarly the BoN policy seems safe to assume is good). In such settings, "always be nice" is close to max entropy, and so "always be nice" is near equilibrium whereas "always be evil" is not at all--and Bob's incentives are proportional.
I suppose highlighting the symmetric solutions assumption obfuscated this intuition.
I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.
A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that's a strong indication of authorship identification capability.
What part of the proposal breaks if we do counterfactuals in input space rather than on the predictor's state?
the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).
Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/curation increases. Dataset cleaning has increased in stringency over time. As a simple example, see my post on dataset deduplication and situational awareness.
(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:
To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology.
Here are a couple examples I came up with: In the NL case, the URL that the text was drawn from. In the code generation case, hardware constraints, such as RAM limits. I don't see why a priori either of these should cause safety problems rather than merely capabilities problems. Would be curious to hear arguments here, and alternative examples which seem more safety relevant. (Note that both of these examples seem like dataset-wide Z).
As Owain's comment thread mentioned, models may be incentivized instrumentally to do causal analysis e.g. by using human explanations of causality. However, even given an understanding of formal methods in causal inference, the model may not have the relevant data at hand. Intuitively, I'd expect there usually not to be any deconfounding adjustment set observable in the data[1]. As a weaker assumption, one might hope that causal uncertainty might be modellable from the data. As far as I know, it's generally not possible to rule out the existence of unobserved confounders from observational data, but there might be assumptions relevant to the LM case which allow for estimation of confoundedness.
The strongest, and most safety relevant implication claimed is "(3) [models] reason with human concepts. We believe the issues we present here are likely to prevent (3)". The arguments in this post increase my uncertainty on this point, but I still think there are good a priori reasons to be skeptical of this implication. In particular, it seems like we should expect various causal confusions to emerge, and it seems likely that these will be orthogonal in some sense such that as models scale they cancel and the model converges to causally-valid generalizations. If we assume models are doing compression, we can put this another way: Causal confusions yield shallow patterns (low compression) and as models scale they do better compression. As compression increases, the number of possible strategies which can do that level of compression decreases, but the true causal structure remains in the set of strategies. Hence, we should expect causal confusion-based shallow patterns to be discarded. To cash this out in terms of a simple example, this argument is roughly saying that even though data regarding the sun's effect mediating the shorts<>ice cream connection is not observed -- more and more data is being compressed regarding shorts, ice cream, and the sun. In the limit the shorts>ice cream pathway incurs a problematic compression cost which causes this hypothesis to be discarded.
High uncertainty. One relevant thought experiment is to consider adjustment sets of unobserved var Z=IsReddit. Perhaps there exists some subset of the dataset where Z=IsReddit is observable and the model learns a sub-model which gives calibrated estimates of how likely remaining text is to be derived from Reddit
Here's another way of thinking about this that should be complementary to my other comment:
Let's assume our safety property we want to ensure is
The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities: