Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters

AI Timelines

Takeoff and Takeover in the Past and Future

Posts

Sorted by New

3Daniel Kokotajlo's Shortform

189AI 2027: What Superintelligence Looks Like

1mo

73OpenAI: Detecting misbehavior in frontier reasoning models

2mo

42What goals will AIs have? A list of hypotheses

3mo

17Extended analogy between humans, corporations, and AIs.

3mo

30Self-Awareness: Taxonomy and eval suite proposal

81AI Timelines

44Paper: On measuring situational awareness in LLMs

20AGI is easier than robotaxis

32Linkpost: Github Copilot productivity experiment

24Replacement for PONR concept

Wikitag Contributions

Comments

Sorted by

Newest

Thomas Kwa's Shortform

Daniel Kokotajlo3d30

Rate of improvement also varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years.

I wish I had thought to blind myself to these results and try to predict them in advance. I think I would have predicted that Tesla self-driving would be the slowest and that aime would be the fastest. Not confident though.

(Solving difficult math problems is just about the easiest long-horizon task to train for,* and in the last few months we've seen OpenAI especially put a lot of effort into training this.)

*Only tokens, no images. Also no need for tools/plugins to the internet or some code or game environment. Also you have ground-truth access to the answers, it's impossible to reward hack.

Cortés, Pizarro, and Afonso as Precedents for Takeover

Daniel Kokotajlo14d30

Update: Just came across this excellent blog post on the same subject, focusing primarily on Pizarro/Peru instead of Cortes/Mexico. https://mattlakeman.org/2025/03/24/conquest-of-the-incas/

Clarifying and predicting AGI

Daniel Kokotajlo26d42

This comment seems to be holding up pretty well. Horizon lengths have indeed been increasing and moreover the trend is plausibly superexponential: https://x.com/DKokotajlo/status/1916520276843782582

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo1mo20

Great question! First of all, we formed our views on AI timelines and had mostly finished writing AI 2027 before this METR graph was published. So it wasn't causally relevant to our timelines.

Secondly, see this comment on the original METR graph in which I make the superexponential prediction. This is the most direct answer to your question.

Third, our timelines forecast discusses the exponential fit vs. superexponential fit and our reasoning; we actually put probability mass in both.

Fourth, new data points are already above the exponential trend.

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Daniel Kokotajlo1mo61

I don't understand your point (a), it seems like a poor response to my point (a).

I agree with (b).

For (c), the models are getting pretty situationally aware and will get even more so... But yeah, your view is that they'll learn the right values before they learn sufficient situational awareness to alignment-fake? Plausible. But also plausibly not.

@evhub would you say Anthropic is aiming for something more like 2 or more like 3?

I totally agree with your point (b) and with the nervousness about how corrigible agents will behave out of distribution. Corrigible agents are dangerous in different ways than incorrigible agents. But again, the plan (which seems to be good to me) is to first build a corrigible agent so that you can then build an incorrigible agent and actually get it right, perfectly right. (Because if you build an incorrigible agent and get something wrong, you may not have a chance to correct it...)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Daniel Kokotajlo1mo125

(a) Insofar as a model is prone to alignment-fake, you should be less confident that it's values really are solid. Perhaps it has been faking them, for example.
(b) For weak minds that share power with everyone else, Opus' values are probably fine. Opus is plausibly better than many humans in fact. But if Opus was in charge of the datacenters and tasked with designing its successor, it's more likely than not that it would turn out to have some philosophical disagreement with most humans that would be catastrophic by the lights of most humans. E.g. consider SBF. SBF had values quite similar to Opus. He loved animals and wanted to maximize total happiness. When put in a position of power he ended up taking huge risks and being willing to lie and commit fraud. What if Opus turns out to have a similar flaw? We want to be able to notice it and course-correct, but we can't do that if the model is prone to alignment-fake.
(c) (bonus argument, not nearly as strong) Even if you disagree with the above, you must agree that alignment-faking needs to be stamped out early in training. Since the model begins with randomly initialized weights, it begins without solid values. It takes some finite period to acquire all the solid values you want it to have. You don't want it to start alignment faking halfway through, with the half-baked values it has at that point. How early in training is this period? We don't know yet! We need to study this more!

Tracing the Thoughts of a Large Language Model

Daniel Kokotajlo2mo60

Awesome work!

In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I'm curious if you agree with that assessment.

So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo2mo62

I found this comment helpful, thanks!

The bottom line is basically "Either we definite horizon length in such a way that the trend has to be faster than exponential eventually (when we 'jump all the way to AGI') or we define it in such a way that some unknown finite horizon length matches the best humans and thus counts as AGI."

I think this discussion has overall made me less bullish on the conceptual argument and more interested in the intuition pump about the inherent difficulty of going from 1 to 10 hours being higher than the inherent difficulty of going from 1 to 10 years.

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo2mo40

Great question. You are forcing me to actually think through the argument more carefully. Here goes:

Suppose we defined "t-AGI" as "An AI system that can do basically everything that professional humans can do in time t or less, and just as well, while being cheaper." And we said AGI is an AI that can do everything at least as well as professional humans, while being cheaper.

Well, then AGI = t-AGI for t=infinity. Because for anything professional humans can do, no matter how long it takes, AGI can do it at least as well.

Now, METR's definition is different. If I understand correctly, they made a dataset of AI R&D tasks, had humans give a baseline for how long it takes humans to do the tasks, and then had AIs do the tasks and found this nice relationship where AIs tend to be able to do tasks below time t but not above, for t which varies from AI to AI and increases as the AIs get smarter.

...I guess the summary is, if you think about horizon lengths as being relative to humans (i.e. the t-AGI definition above) then by definition you eventually "jump all the way to AGI" when you strictly dominate humans. But if you think of horizon length as being the length of task the AI can do vs. not do (*not* "as well as humans," just "can do at all") then it's logically possible for horizon lengths to just smoothly grow for the next billion years and never reach infinity.

So that's the argument-by-definition. There's also an intuition pump about the skills, which also was a pretty handwavy argument, but is separate.

METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo2mo89

I don't believe it. I don't believe that overall algorithmic progress is 3x faster. Maaaybe coding is 3x faster but that would maybe increase overall algo progress by like 30% idk. But also I don't think coding is really 3x faster on average for the things that matter.