Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda

Disclaimer: Post written in a personal capacity. These are personal opinions and do not in any way represent my employer's views

TL;DR:

I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise.
Interpretability still seems a valuable tool and remains worth investing in, as it will hopefully increase the reliability we can achieve.
However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy
It is not the one thing that will save us, and it still won’t be enough for high reliability.

EDIT: This post was originally motivated by refuting the claim "interpretability is the only reliable path forward for detecting deception in advanced AI", but on closer reading this is a stronger claim than Dario's post explicitly makes. I stand by the actual contents of the post, but have edited the framing a bit, and also emphasised that I used to hold the position I am now critiquing, apologies for the mistake

Introduction

There’s a common argument made in AI safety discussions: it is important to work on interpretability research because it is a realistic path to high reliability safeguards on powerful systems - e.g. as argued in Dario Amodei’s recent “The Urgency of Interpretability”.^[1] Sometimes an even stronger argument is made, that interpretability is the only realistic path to highly reliable safeguards - I used to believe both of these arguments myself. I now disagree with these arguments.

The conceptual reasoning is simple and compelling: a sufficiently sophisticated deceptive AI can say whatever we want to hear, perfectly mimicking aligned behavior externally. But faking its internal cognitive processes – its "thoughts" – seems much harder. Therefore, goes the argument, we must rely on interpretability to truly know if an AI is aligned.

I am concerned this line of reasoning represents an isolated demand for rigor. It correctly identifies the deep flaws in relying solely on external behavior (black-box methods) but implicitly assumes that interpretability doesn't suffer from equally fundamental problems. There are many deep issues in interpretability that prevent very confident conclusions, even if we assume models cannot deliberately obfuscate their thoughts, e.g. superposition and the inherent error in our best tools and techniques. The challenges of interpretability do not seem qualitatively easier to solve than the big issues in black box tests, especially with more creative black-box tools like monitoring or editing the system’s chain of thought.^[2]

Should we give up on interpretability? No! I still think it has the potential to add a lot of value, and we will have better safeguards with interpretability as part of our portfolio. Even if it adds no value for making superintelligence safer^[3], if it can add value for pre-superintelligence transformative systems that would be enough to justify investment. I just think that we should be more pragmatic about interpretability’s likely impact, and accept that while we can generally improve our safeguards we will likely not reach high reliability.

High Reliability Seems Unattainable

Based on the current state and foreseeable trajectory of the field without major paradigm shifts, I think that neither interpretability nor black box methods offer a high reliability^[4] path to safeguards for superintelligence, in terms of evaluation or monitoring. This is due to fairly fundamental limitations of both methods, unless there are substantial breakthroughs, e.g. via using pre-superintelligence systems. This also stems from a general prior that being highly confident in any complex property of a complex system in a range of unpredictable situations is fairly implausible.

However, I am pretty pessimistic in general about reliable safeguards against superintelligence with any methods, given how exceptionally hard it is to reason about how a system far smarter than me could evade my plans. As I see it we must either not create superintelligence, rely on pre-superintelligent automated researchers to find better methods, or deploy without fully reliable safeguards and roll the dice, and do as much as we can now to improve our odds.

This doesn't mean we should give up! It means we need a pragmatic perspective. We should aim to build the best possible monitoring and evaluation portfolio we can, using all available tools, while accepting that high reliability might be out of reach. Interpretability can add a valuable source of de-correlated signal, or augment black box methods. The goal shifts from achieving near-certainty to maximizing the chances of catching misalignment, making deception harder and riskier for the AI, even if we can't guarantee detection.

Further, methods that add significant safety to pre-superintelligent transformative systems still add significant value even if they don’t scale to superintelligence - one of the key insights behind the AI control agenda. Early transformative systems seem likely to be far easier to evaluate and control, but may still be very dangerous if unchecked, and extremely useful if under control. And when evaluating methods according to how much they help with early transformative systems, interpretability is just one among several reasonable methods (and probably not my first choice - things like chain of thought monitoring seem like lower hanging fruit)

Why Won’t Interpretability be Reliable?

Caveat: The following is my inside view for the deep reasons why interpretability will never be truly reliable for human science, no matter how much the field progresses. This is a very strong statement, so I'm not that confident here - while it’s obvious that current techniques are unreliable, this is a question of forecasting future research progress, a highly questionable endeavour. It’s very hard to say what might be possible with a paradigm shift in how LLMs are trained, or how we do interpretability.

More importantly, I do not feel able to reason about the effects of lots of AGIs doing AI research. "Thousands of years of mech interp geniuses in a data center" is a crazy anomaly that I can't reason well about, and seems reasonably likely to happen before superintelligence, so I want to hedge. Not only could the research lead to unexpected breakthroughs, it opens up all kinds of weird strategies leveraging the existence of the AGIs like inspecting every SAE latent in detail - though I'd imagine scalable oversight/control schemes using the AGIs should be higher priority than interp. I also think we should be fairly uncertain about what safety research directions the automated researchers will find to be most promising.

That aside, reasons I won't be comfortable relying on interpretability:

We miss things: We're still figuring out the basics. Concepts like superposition mean important features can be smeared across many neurons in non-obvious ways. Our tools like SAEs or transcoders have error and often miss things. We can misinterpret the features found, or only explain a fraction of the model's behavior.
It’s hard to measure progress: We don’t have a ground truth for what’s actually going on, and must inevitably use a range of proxies of varying quality. This means it’s hard to ever be confident you really understand what’s happening.
The Long Tail Problem: Models seem to have both some clean, interpretable structure and a massive, messy "long tail" of weird heuristics, special case handlers, and half-learned, noisy behaviours. I’d personally guess that important capabilities, including deception, are in the clean part, but can’t say for sure!
How to prove a negative: We can find evidence for or against a hypothesis, but rigorously proving the absence of deception circuits seems incredibly hard. How do you know you didn't just miss it? How much of the model do you need to understand? 90%? 99%? 99.99%? The required level of certainty for high-stakes deployment feels far beyond our current capabilities.
- I’m more optimistic about interpretability as a way to potentially demonstrate misalignment, but failure to do so tells us little.
Scalability Questions: Most of our best results are on toy models or yesterday's LLMs. I think that scaling is a tractable problem, but likely involves a bunch of LLM labour assisting, another source of error and uncertainty.
Potential for Obfuscation: While maybe harder than faking outputs, a smart AI could potentially learn to structure its internal computations to be opaque to our specific interpretability tools.

The Potential of Black-Box Methods

Black-box methods, i.e. things that engage with the model’s inputs and outputs, have clear issues: models can already tell when they’re being evaluated, or realise when they need to instrumentally fake obedience. But I think there’s also a lot of clever tricks you can do to strengthen them, and they might go a long way. This is more a question about AI control than my areas of expertise, but some thoughts on directions I find promising:

Reading the chain of thought (with another LLM) and looking for sketchy intent^[5]
Editing the model’s chain of thought so it e.g. thinks it concluded that it’s not in an evaluation
Asking questions to many copies of the same model and look for inconsistencies while they can’t coordinate.
Finetuning the model to e.g. be more truthful or forthcoming, or to insert misleading beliefs and trick it.
Creating honeypots that try to trick the model into revealing its intent

The Role of Interpretability

So, where does interpretability fit into this pragmatic portfolio? I see several key roles, viewing it not as a standalone solution, but as a valuable component.

Firstly, safeguards can be categorised as follows:

Evaluation: We take a system and perform extensive tests on it for undesired properties, like a tendency towards misalignment or sandbagging
Monitoring: When we deploy a system we are always running some checks at inference, which try to catch and mitigate bad behaviour
1. Importantly, this needs to be cheap to be run in production, while evaluations can be more expensive and slow

How can interp help?

A Layer of Swiss Cheese: I think the right monitoring schemes have several layers of defence (a la the swiss cheese model), with different cost to accuracy trade-offs.
- Interpretability techniques like probes for known dangerous features are extremely cheap, and surprisingly effective, and can serve as a good first layer of defence, which call more expensive tests (e.g. an LLM reading the chain of thought) if they fire.
- Interpretability techniques like SAEs or cross-layer transcoders may have the potential to be more expensive but accurate ways to audit what’s going on and look for anything unexpected.
- Even if interpretability is not more accurate, so long as it has some signal and is de-correlated, then it adds value, even if it’ll miss many things.
Enhancing Black-Box Evaluations: Interpretability might make our external checks more effective. For example:
- Could we use steering vectors identified via interpretability to manipulate whether the model thinks it’s in an eval? Maybe we can test it more effectively if we can trick it into thinking it's not in a honeypot or evaluation setting.
- Can we use interpretability to better elicit information from token forcing? We can make the model say “my secret plan is”, and even if it tries to internally suppress awareness or obfuscate the output perhaps interpretability can prevent that?
- Interpretability may provide a sanity check that our black box methods are not missing something, e.g. whether a model is passing an eval only because it knows it’s in one and is deliberately underperforming.
Debugging mysterious behaviour: One particularly interesting form of evaluation (or offline monitoring) is taking particularly weird or unexpected examples of model behaviour, e.g. it seeming to put vulnerabilities in critical code, and trying to debug what’s going on and see if it’s real alignment or not. I think interpretability methods may prove very good at raising unexpected hypotheses, and piecing together what’s happening (e.g. this proof of concept)
- I expect these kinds of potential warning shots to happen, and anything that can add more signal here would be great.
- Even if we don’t trust the interpretability method fully, the hypotheses generated can be verified by other means.

These applications aren't guaranteed to work or be ungameable, but they represent concrete ways internal understanding could augment external validation, creating a more robust overall system than either alone.

Conclusion

Let's stop framing interpretability as the one true path to reliably detecting AI deception, a view often born from an isolated demand for rigor against flawed black-box methods. Neither approach offers high reliability against superintelligence without fundamental breakthroughs, but flawed methods can still be useful. For the powerful pre-SI systems that concern us now, let's focus on building the best pragmatic portfolio of imperfect defenses we can. Interpretability is a useful part of that portfolio – offering unique angles of analysis and potential synergies with other methods – but it's one tool among many, not a substitute for a comprehensive, defense-in-depth strategy. We need to leverage every imperfect signal we can get, from both inside and outside the box.

Thanks to my co-author Gemini 2.5 Pro

While this post was inspired by Dario's post and ensuing discussion, this is a common belief I've encountered, and this post is not intended to call out him specifically! ↩︎
More specifically, I basically think the only way we might get high reliability is by using pre-superintelligent systems that are sufficiently aligned/controlled to either do a ton of automated research and find some breakthroughs, or by using them as a core part of a monitoring or scalable oversight scheme. But the human research trajectory is hard. ↩︎
Defining “safety” is kinda messy, for the purposes of this post you can interpret it as evaluating ‘is the model thinking misaligned thoughts’ or monitoring for ‘is the model currently knowingly taking high-stakes actions against what we want’ ↩︎
I’m not sure what the best precise definition of highly reliable is here, maybe 99.9% chance of stopping a catastrophic outcome? It’s not key to my argument. ↩︎
Obviously, Chain of Thought can fail to reflect the model’s true reasoning. But I think that for tasks at the limits of its ability it likely needs to use the chain of thought for intermediate steps in the computation, creating an incentive to store key info in there. Of course, there are many ways this can fail. ↩︎

^{^}

or maybe they'll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.

[-]Buck Shlegeris2mo4951

I agree with most of this, thanks for saying it. I've been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.

My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe. This is why I mostly focus on control measures other than CoT monitoring. (All of our control research to date has basically been assuming that CoT monitoring is unavailable as a strategy.)

Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models. Interp doesn't obviously help much with those, which makes it a worse target for research effort.

[-]David Manheim2mo10

CoT monitoring seems like a great control method when available

As I posted in a top level comment, I'm not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.

[-]Logan Riggs Smith2mo64

I had this position since 2022, but this past year I've been very surprised and impressed by just how good black box methods can be e.g. the control agenda, Owain Evan's work, Anthropic's (& other's I'm probably forgetting).

How to prove a negative: We can find evidence for or against a hypothesis, but rigorously proving the absence of deception circuits seems incredibly hard. How do you know you didn't just miss it? How much of the model do you need to understand? 90%? 99%? 99.99%?

If you understand 99.9% of the model, then you can just run your understanding, leaving out the possible deception circuit in the 0.1% you couldn't capture. Ideally this 99.9% is useful enough to automate research (or you use the 99.9% model as your trusted overseer as you try to bootstrap interp research to understand more percentage points of the model).

[-]Buck Shlegeris2mo*139

I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it's trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.

[-]Neel Nanda2mo43

I disagree re the way we currently use understand - eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don't quite mean what we think, etc.

It's plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don't know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don't like into a system we don't think we understand.

Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!

[-]Charbel-Raphael Segerie1mo*20

Thanks a lot for writing this, this is an important consideration, and it would be sweet if Anthropic updated accordingly.

Some remarks:

I'm still not convinced that Deceptive AI following scheming is the main risk compared to other risks (gradual disempowerment, concentration of power & value Lock in, a nice list of other risks from John).
"Should we give up on interpretability? No!" - I think this is at least a case for reducing the focus a bit, and more diversification of approaches
On the theories of impacts suggested:
- "A Layer of Swiss Cheese" - why not! This can make sense in DeepMind's plan, that was really good by the way.
- "Enhancing Black-Box Evaluations" - I think a better theory is interp to complement AI Control techniques. Example: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals).
  - Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
- "Debugging mysterious behaviour" - Might be interesting, might help marginally to get better understanding, but this is not very central for me.

[-]David Manheim2mo1-1

First, strongly agreed on the central point - I think that as a community, we've been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.

That said, I am concerned about what happens if interpretability is wildly successful - against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on "miss things," "measuring progress," and "scalability," partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And thoughts from @Buck / @Adam Shai as well!)

[-]Sodium2mo00

Re: Black box methods like "asking the model if it has hidden goals."

I'm worried that these methods seem very powerful (e.g., Evans group's Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven't undergone a lot of optimization pressure.

Outputs from real world models might undergo lots of scrutiny/optimization pressure^[1] so that the model appears to be a "friendly chatbot." AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to "say nice things" much better.

So it's possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings.

^{^}
or maybe they'll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.

[-]Buck Shlegeris2mo106

This is why AI control research usually assumes that none of the methods you described work, and relies on black-box properties that are more robust to this kind of optimization pressure (mostly "the AI can't do X").

[-]Neel Nanda2mo20

Agreed in principle, my goal in the section on black box stuff is to lay out ideas that I think could work in spite of this optimisation pressure

AI ALIGNMENT FORUM
AF

117