The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley

This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.

For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on data including a lot of clearly marked examples of aligned behavior (then prompt for it).

I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.

I highly encourage everyone interested in AI alignment to go read both of these papers (if you haven't already) — between them they strongly suggest that the authors have found a more effective way to align an AI: an alignment approach better than any that people are (as far as we know) currently using. I believe this is extremely important: I see it as major progress on alignment. So I think it directly reduces the p(DOOM) for the most critical current x-risk to our entire species.

For more detailed expositions of this approach and why I think it's an excellent idea, see my previous posts How to Control an LLM's Behavior (why my P(DOOM) went down), A "Bitter Lesson" Approach to Aligning AGI and ASI, and Why Aligning an LLM is Hard, and How to Make it Easier.

(I'm also delighted that the authors of the recent paper tested out some of the follow-on ideas I'd been proposing in those posts on Less Wrong. One was training the model to generate control-tag tokens that label portions of the text as good or bad behavior, and then for conditional generation altering the token generation process, leveraging these tokens, so as to induce the model to behave well not badly. Another was using synthetic data editing to modify problematic raw training examples by supplementing them with more moral or correct behavior or commentary. They elaborated on both of these, or independently reinvented them, and even confirmed that both of these appear to work about as well as I'd been hoping.)

Hence, in order to encourage people to read this post and get to hear about these groundbreaking papers, I suggested a rather bold possibility in my title: that inner alignment may now be basically a solved problem — let me try to justify that position:

A brief explanation for anyone wondering "what's inner alignment, and why should I care about it?"

The alignment problem is frequently broken down into two subproblems: Outer Alignment, figuring out what human values are and how to define, codify, or recognize them, and Inner Alignment, how to train our AI to agentically optimize human values and not anything else; or, as the LessWrong page on inner alignment defines it:

Inner Alignment is the problem of ensuring mesa-optimizers^[1] (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

Some people in the alignment field consider the outer alignment subproblem to have been solved in theory over a decade ago when Value Learning was proposed, and that inner alignment is thus the hard part of the alignment problem. This viewpoint has become more widespread as it has become apparent that LLMs actually have rather detailed world models of what human values are, in all their messy, fragile complexity, suggesting that Value Learning can be performed just by pre-training an LLM, and thus that outer alignment is also soluble in practice. Perhaps we don't need to attempt to compactly define human values: the messy and complex version of them implicit from pretraining on most of the Internet may be sufficient as is. If so, this not only solves outer alignment, but significantly simplifies inner alignment: we don't need to accurately define a compact, exactly-and-everywhere-correct formal definition (i.e. a "True Name") of human values — an incredibly challenging task given just how messy, complex, and fragile they are; we can just train an LLM on a vast amount of human-generated data, and it will develop a world model of human values, along with all the other things it learns to understand about us. Now we need to get an agentic AI to care about human values and not about anything else (or at least, to act that way): that's the inner alignment problem. We just need to retarget the search.

There are fundamentally only three ways that we know of to train an LLM to do anything (including aligning it):^[2]

pretraining, using Stochastic Gradient Descent (SGD) on a large dataset
fine-tuning, using either SGD or more contrastive techniques such as Differential Preference Optimization (DPO) on a smaller dataset
Reinforcement Learning (RL), of various different types

The third of these is currently the most common approach used for alignment.

The people who originally came up with the inner alignment^[3] vs. outer alignment subdivision were thinking in the context of a reinforcement learning approach (as the choice of the phrase "objective function of the training process" in the LW definition attests). As Eliezer Yudkowski's Sequences argued at length, and as more recent major survey papers,^[4] such as Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (2023) by Stephen Casper, Xander Davies et al., have cataloged in exhaustive detail, reinforcement learning is a very challenging technique to get right. The basic problem with RL is that it's inherently adversarial: it involves an interaction between two systems, a learner and a rater, where the learner is trying to learn how to get a good rating, and the rater is trying to ensure that the only way that the learner can get a good rating is by actually learning the desired behavior. Any flaw in the rater's ratings that lets the learner score better than it deserves (and that isn't actually harder to exploit than just doing the desired behavior) can, and almost certainly will, be ruthlessly exploited by the learner. So RL is inherently just begging to fail via Goodhart's Law:^[5] even if the ratings are correct almost everywhere, the learner is searching for any area where they are significantly overestimated, or any means of inducing overestimation errors from the rater, and will enthusiastically exploit any exploitable errors that it can find.^[6]

Since for alignment the desired behavior requires (in some cases super-humanly) intelligently doing the right thing according to criteria as messy, complex and fragile as human values, using human raters is both expensive and fallible, since humans are fallible, including being vulnerable to manipulations, such as sycophancy or flattery, that encourage errors in a particular direction, and also are less smart than any superintelligent learner they're trying to rate. On the other hand, trying to devise, construct, or train an automated rating system is both inherently challenging, and for sufficient reliability for adversarial use during RL requires that the rater be much smarter than the learner, so that it's unlikely to have any flaws that the learner can find and exploit — which makes RL impractical for training any frontier system, since we can't build a rater much smarter than the frontier learner.

The inner alignment challenges of using RL to train very smart learners have been discussed at great length on LessWrong and the Alignment Forum for a long time, and many of them seem insurmountable. We are taking an SGD-learned simulation of human behavior (which is already agentic but has an optimization target that differs significantly in many ways from aligned behavior) and use it to cold-start an RL training process whose base optimization target is well-aligned behavior. As the authors of The Inner Alignment Problem point out, the problem with this is that there is no guarantee that the optimization target of the mesa-optimizer trained by an RL process will match the target of the rater: it may just learn proxies for it. So, any alignment approach that uses reinforcement learning (which includes many techniques currently in widespread use, such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI) is going to be inherently dangerous; and as AI nears and then exceeds human capabilities this problem gets rapidly worse, because creating an unexploitable rater gets harder for us. Thus we are going to have to stop trying to use RL for alignment — it's not workable for frontier AGI or ASI.

That leaves just approaches 1: pretraining, and 2: fine-tuning. The approach of safety pretraining is simply to pretrain on a dataset that contains many labelled examples of each of two similar-but-in-places-different types of agentic behavior: human behavior and aligned AI behavior. Since these are similar, we would expect strong positive transfer between the two SGD tasks of learning to do next-token prediction on both of them. We should then get a model capable of simulating two different categories of mesa-optimizer personas: ones with human-like goals and ones with aligned-AI-like goals. Then at inference time, we conditionally generate an example of aligned AI behavior.

SGD (whether for pretraining or fine-tuning) not adversarial: it's an exercise in curating a training set that demonstrates the desired behavior, not building a rating system to rate any possible input (including adversarial ones) for its desirability. If your training set is less than perfect, a system trained from it is also likely to behave less than perfectly — but unlike reinforcement learning, there is no adversarial incentive in the training process that encourages the learner to find and ruthlessly exploit any small flaw. If your training set is 99% good and 1% bad, then a-priori from a cross-entropy loss you would expect a (sufficiently high-capability) AI trained from it to have a behavior distribution that was also somewhere around 99% good and 1% bad, at least inside the training distribution: fundamentally, modulo prompting, in self-supervised SGD, the distribution you train on is the distribution you get.^[7]

99% good behavior is not perfect, but we have managed to build functional human societies out of unreliably-trustworthy humans, and I'm fairly confident that if we had AIs whose moral judgement and alignment could be relied upon even just 90% of the time, we could construct more reliable systems out of multiple AIs (or multiple runs of the same AI with differing prompts or LoRAs), likely using techniques such as majority voting, debate, cross-checks, checks-and-balances, and fault-tolerance protocols. Converting 'pretty reliable' into 'very reliable' is a well-studied problem, in both software and organizational contexts.

Both the papers that I link to above test the pretraining approach to alignment against the fine-tuning approach — and they repeatedly and consistently find that the pretraining approach wins by significant margins. As one might expect, using a larger alignment training set induces more reliable behavior. So we now know how best to align AI: safety pretraining is the most effective and least dangerous approach. Thus, inner alignment is basically solved, alongside outer alignment (in my and many people's opinion). So we have an outline of a complete solution to alignment.

Note that, for both pretraining and fine-tuning, if you're using automated techniques to help curate, filter, or synthesize your training set (which you almost certainly are, especially for the pretraining approach where the dataset is extremely large), then unlike the situation for (online) RL those only need to work well inside your training set distribution — you're not trying to build something that also works well outside that distribution, across any input that a superintelligent learner might devise to abuse it.

On reliability, while no huge pretraining data set is ever going to be perfect, we have a lot of experience at hill-climbing while using SGD: identify the failures that still happen a small proportion of the time, figure out what documents in the pretraining set inspired them and/or what deletions, modifications, or additions could reduce or prevent them, edit the training set, retrain, and iterate. Admittedly, an iteration loop that requires us to pretrain a frontier model again in each cycle is going to be slow and expensive, but both paper's results strongly suggest that we can experiment and iterate via fine-tuning and then, once we have a good solution, transfer that to pretraining for a sizable boost in its reliability. That gives us an inner and outer loop for this hill-climbing process.

It would be a fair point to argue that inner alignment is solved only in theory, and that the practical problem of curating an extremely large pretraining-sized dataset that accurately portrays and teaches both what human value are, and what AI behavior correctly aligned to those human values looks like, remains a large problem. However, that's also a well-understood and partially-solved problem, since it's inherently similar to the problems in pretraining dataset curation and synthetic data generation that many capabilities researchers have been working and making progress on over the entire history of machine learning. We can confidently expect them to continue to improve this, in the era of synthetic data. Reducing alignment to just what looks like a well-known capabilities data science task is dramatic progress.

The safety pretraining approach is also timely. We are fast running out of the highest-quality pretraining data, and will increasingly need to rely on using, or at least supplementing with, synthetic data. The recent paper very explicitly shows how to do alignment using a synthetically augmented dataset, and shows how this can be used to align an LLM's behavior to any desired set of ethical criteria. Note that safety pretraining is a "dual use" technological advance — it would also help us train a better paperclip maximizer, if we wanted to do that: we'd just need to generate a suitable pretraining dataset for it.

There are some other important ideas in these papers that I've skipped over in my argument so far, beyond just the demonstration that the safety pretraining approach is the best: there are also a few techniques that are required to get it to work that well. For instance, both papers demonstrate that it is more effective to train the LLM to understand both aligned behavior (what we want the AI to do), and unaligned behavior (which humans do lots of, so it will encounter), and train it to correctly distinguish the two and label them, then use a conditional generation approach at inference time to make it generate only aligned behavior. So the training distribution needs to include all the unaligned aspects of human behavior. The more recent paper does this at a higher level of sophistication on larger models for more challenging alignment issues, but the results are consistent with those of the earlier paper. This idea is also unsurprising: it matches how we generally raise children: we don't just teach them how to be good, they also learn (on a developmentally appropriate syllabus) what bad behavior is, how to tell the two apart, and why bad behavior is bad. These are important skills, and AI needs them too.

So, I believe inner alignment is solved, in the sense that it has been reduced to just the standard problem of training dataset curation.

Thus, if you haven't yet done so, I strongly recommend you read these two papers.

^{^}
'Mesa-optimizer' here is an older term for an ML model that is what we would now generally call an agent (or sub-agent): any smart ML system capable to planning and executing appropriate actions to attempt to bring about outcomes that are optimized according to some criterion.
^{^}
I exclude prompting and in-context learning, since they're not training the LLM, only conditioning its behavior on a context. Human values are complex enough aligning to them seems likely to require a very large prompt. However, for a more capable agent already sufficiently familiar with human values, or one with a very clear understanding of what aligned behavior is, a more compact prompt might be feasible.
Also, using the same argument as Fundamental Limitations of Alignment in Large Language Models (2024) by Yotam Wolf, Yaom Wies et al., any behavior that a prompt can induce will always be vulnerable to being overwritten by a suitable jailbreak or prompt-injection attack.
^{^}
The origin of the term "mesa-optimizer" that is generally used in defining inner alignment is (as explained in The Inner Alignment Problem) that your ML training process is, of course, an optimizer, and in some situations it may produce as its output a model that is also an optimizer, i.e. one that acts in an agentic way.
For an LLM, where the pretraining data includes large amounts of data derived from humans, our (evolved) agentic behavior is being distilled into the model by the SGD task of next-token predicting output from us, so the base model produced by this training will be capable of simulating human behavior — i.e. it will be agentic, and thus it will be a mesa-optimizer. Or more accurately, the various human-like personas it simulates (depending on prompting) are individually mesa-optimizers — ones which may optimize somewhat different goals.
The goal of inner alignment is to change the optimization target of these simulated agents/mesa-optimizers from human-like behavior to aligned-AI-like behavior.
^{^}
Some more major papers that address this topic:
Concrete Problems in AI Safety (2016) by Dario Amodei, Chris Olah, et al.
Managing Extreme AI Risks amid Rapid Progress (2023) by Jan Brauner, Sören Mindermann et al. (coauthors including both Yoshua Bengio and Geoffrey Hinton)
AI Alignment: A Comprehensive Survey (2023–2025) by Jiaming Ji, Tianyi Qiu et al.
^{^}
Specifically, under Scott Garrabrant's taxonomy of forms of Goodhart's Law phenomena, this is "adversarial Goodhart". For a more mathematical discussion of why adversarial Goodhart very frequently occurs during Reinforcement Learning, see for example the paper Goodhart's Law in Reinforcement Learning (2023) by Jacek Karwowsk et al.
^{^}
This problem is worse for online reinforcement learning, where the learner has control of the distribution of episodes to be rated, and thus the ability to locate and then abuse flaws in the rater's performance no matter where they may be. Whereas in offline reinforcement learning, where the rated episodes are drawn from some other distribution not controlled by the learner, the learner only gets to see and exploit rating errors within, and thus the rater only needs to be able to do a sufficiently good job of rating everywhere across, whatever distribution of episodes is being used, rather than absolutely everywhere. So while the relationship between the rater and the learner is still adversarial in both, the learner's advantage over the rater is more constrained in offline RL than in online RL. Thus both are dangerously prone to Goodharting, by somewhat different mechanisms, but online RL is the worse of the two. Unfortunately online RL is what is typically used to align LLMs.
The remaining problem with offline RL is that, while it avoids a distribution shift happening during the RL training, there definitely will be one (with an opportunity for Goodharting) when the learner is actually run, because its trained policy isn't what created the rated episodes set. This is in contrast to distilling agentic behavior from one intelligence to another via SGD pretraining on a dataset, where the distribution you train on is the behavior you get from the trained model, to the extent that it's capable enough to do this (modulo various issues around temperature, regularization, statistical and batch effects, and so forth making the model's copy of distribution less accurate: the cross-entropy loss encourages the model distribution to match the training distribution, but other factors can distort this).
^{^}
The cross-entropy objective in SGD produces a model whose behavior-distribution closely approximates the training distribution. So when SGD distilling a teacher mesa-optimizer, the student will learn to optimize a goal (or distribution of goals) that produces the same distribution of agentic behavior as the teacher. If you have two teachers, a labeled mix of human and aligned-AI behavior, the model will learn simulate a labeled mix of the same two behaviors, directed at these two goals.
Unlike the situation in RL alignement, where the question is whether the target of the mesa-optimizer matches that of the base optimizer (i.e. of the rater), in SGD the base optimizer's goal is just 'predict the correct token distribution' — it's not agentic in any meaningful sense. So for safety pretraining, the question becomes whether the process of distilling the agentic behavior from the teacher to the student model has been lossy, or oversimplifications have occurred. If so, we presumably need a larger and more diverse training set.
Of course, in safety pretraining, the teacher is itself a simulation, rather than a single AI model: it is the process that produced Internet, books etc. (human culture) plus the entirely of whatever process (human and AI-assisted, and likely also iterative) we use to curate and supplement our dataset with examples of aligned AI behavior. Should that dataset, for example, have minor internal inconsistencies, such that it implies a distribution of goals near aligned AI behavior, then we would expect the distillation process to produce a base model that simulates a similar distribution of personas with goals near aligned AI behavior (as modulated by prompting).
If the student preforms well in distribution, against a held-out samples from the training set distribution, then the remaining concern then is whether the optimization target of the aligned-AI student might actually differ from that of the AI-aligned teacher (as expressed in the synthetic data), while being similar enough to cause matching distributions of behavior across the entire AI-aligned training distribution. Or, since in practical terms the definition of the teacher is just the training set, perhaps it would be more useful to think of this as that the behavior of the teacher is not entirely well-defined outside the training set, in situations sufficiently novel that there is no single clear extrapolation (having low Kolmogorov complexity) from the behavior inside the training set. Then some subsequent distribution shift taking us outside the training distribution might cause the match to fail (or perhaps we should say, the student's behavior to be unpredictable since the teacher's behavior is not well defined), via Goodhart's Law.
Extrapolating successfully across distribution shifts to outside the training distribution is a generic problem inherent to every ML model, so this is not a problem we can hope to find a complete solution to. However, in general we have observed that more capable models with more complex and sophisticated world models tend to be more robust to distribution shifts.
As I mentioned in a previous footnote, a key disadvantage of RL for alignment is that it inherently tends to cause distribution shifts: for online learning during the RL, or for offline RL afterwards once the model is run. Whereas a model trained by SGD has no inherent tendency to leave the training distribution, and will only do so if presented with a prompt that causes it to do so, for example by differing in some relevant way from anything in the training distribution (for instance, such as by implying an entire new category of moral problem). Over time, this will inevitably happen sooner-or-later, and we will thus need to retrain our models periodically as our society changes, but we already had to do that simply to update their knowledge-base.