AI ALIGNMENT FORUM
AF

Dakara
0088
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
By Default, GPTs Think In Plain Sight
Dakara7mo10

After 2 years have passed, I am quite interested in hearing @Fabien Roger's thoughts on this comment, especially this part "But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.".

Reply
Simple probes can catch sleeper agents
Dakara7mo00

This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.

I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.

Reply
AI
5mo
(-16)
AI
5mo
(-81)
AI
5mo
(+12/-9)
AI
5mo
(+19/-27)
Interpretability (ML & AI)
5mo
(-13)
AI
5mo
(+12/-9)
AI
5mo
AI
5mo
AI
5mo
(+18/-31)
AI Risk Skepticism
5mo
(-13)
Load More
No posts to display.