AI ALIGNMENT FORUM
AF

Dakara

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Dakara

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

By Default, GPTs Think In Plain Sight

Dakara7mo10

After 2 years have passed, I am quite interested in hearing @Fabien Roger's thoughts on this comment, especially this part "But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.".

Reply

Simple probes can catch sleeper agents

Dakara7mo00

This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.

I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.

Reply

AI

5mo

(-16)

AI

5mo

(-81)

AI

5mo

(+12/-9)

AI