I disagree re the way we currently use understand - eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don't quite mean what we think, etc.

It's plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don't know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don't like into a system we don't think we understand.

Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!

Reply

Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda20d20

Agreed in principle, my goal in the section on black box stuff is to lay out ideas that I think could work in spite of this optimisation pressure

Reply

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda2mo20

I think the correct question is how much of an update should you make in an absolute sense rather than a relative sense? Many people in this community are overconfident and if you decide that every person is less worth listening to than you thought this doesn't change who you listen to, but it should make you a lot more uncertain in your beliefs

Reply

Fabien's Shortform

Neel Nanda2mo20

Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard

Reply

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Neel Nanda2mo82

Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I'm not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...

Reply

EIS XV: A New Proof of Concept for Useful Interpretability

Neel Nanda2mo32

I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It's also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn't surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I'd guess an LLM classifier is more effective. Did they do that in the paper?

(To be clear, I think it's a great paper, and to the degree that there's a disagreement here it's that I think your predictions weren't covering the right comparative advantages of interp)

Reply

Auditing language models for hidden objectives

Neel Nanda2mo1417

This is fantastic work, I'd be very excited to see more work in the vein of auditing games. It seems like the one of the best ways so far to test how useful different techniques for understanding models are

Reply

AGI Safety & Alignment @ Google DeepMind is hiring

Neel Nanda3mo44

In my incredibly biased opinion, the GDM AGI safety team is great and an effective place to work on reducing AI x-risk, and I would love to get applications from people here

Reply

1

Refusal in LLMs is mediated by a single direction

Neel Nanda4mo40

For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs - see this list of the 2000+ "abliterated" models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don't actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.

Reply

1

Tips and Code for Empirical Research Workflows

Neel Nanda4mo56

This looks extremely comprehensive and useful, thanks a lot for writing it! Some of my favourite tips (like clipboard managers and rectangle) were included, which is always a good sign. And I strongly agree with "Cursor/LLM-assisted coding is basically mandatory".

I passed this on to my mentees - not all of this transfers to mech interp, in particular the time between experiments is often much shorter (eg a few minutes, or even seconds) and often almost an entire project is in de-risking mode, but much of it transfers. And the ability to get shit done fast is super important

Reply