Thane Ruthenis - AI Alignment Forum

The way I'd phrase it^[1] is that the set of all acausal deals made by every civilization with every other civilization potentially has an abstract hierarchical structure, same way everything else does. Meaning there are commonly reoccurring low-level patterns and robust emergent high-level dynamics, and you can figure those out (and start following them) without actually explicitly running full-fidelity simulations of all these other civilizations. Doing so would then in-expectation yield you a fair percentage of the information you'd get from running said full-fidelity simulations.

This is similar to e. g. how we can use the abstractions of "government", "culture", "society" and "economy" to predict the behavior of humans on Earth, without running full-fidelity simulations of each individual person, and how this lets us mostly correctly predict the rough shape of all of their behaviors.

I think it's on-its-face plausible that the acausal "society" is the same. There are some reasons to think there are convergently reoccurring dynamics (see the boundaries discussion), the space of acausal deals/Tegmark IV probably has a sort of "landscape"/high-level order to it, etc.

(Another frame: instead of running individual full-fidelity simulations of every individual civilization you're dealing with, you can run a coarse-grained/approximate simulation of the entirety of Tegmark IV, and then use just that to figure out roughly what sorts of deals you should be making.)

^{^}
Or maybe this is a completely different idea/misinterpretation of the post. I've read it years ago and only skimmed it now, I may be misremembering. Sorry if so.

Natural Latents: The Concepts

Thane Ruthenis6d*10

Cool. I've had the same idea, that we want something like "synergistic information present in each random subset of the system's constituents", and yeah, it doesn't work out-of-the-box.

Some other issues there:

If we're actually sampling random individual atoms all around the dog's body, it seems to me that we'd need an incredibly large amount of them to decode anything useful. Much fewer than if we were sampling random small connected chunks of atoms.
- More intuitive example: Suppose we want to infer a book's topic. What's the smallest such that we can likely infer the topic from a random string of length $N$ ? Comparatively, what's the smallest $M$ such that we can infer it from $M$ letters randomly and independently sampled from the book's text? It seems to me that $N ≪ M$ .
But introducing "chunks of nearby variables" requires figuring out what "nearby" is, i. e., defining some topology for the low-level representation. How does that work?
Further, the size of the chunk needed depends a lot on which part of the system we sample, so just going "a flat % of all constituents" doesn't work. Consider happening to land on a DNA string vs. some random part of the interior of the dog's stomach.
- Actually, dogs are kind of a bad example, animals do have DNA signatures spread all around them. A complex robot, then. If we have a diverse variety of robots, inferring the specific type is easy if we sample e. g. part of the hardware implementing its long-term memory, but not if we sample a random part of an appendage.
- Or a random passage from the book vs. the titles of the book's chapters. Or even just "a sample of a particularly info-dense paragraph" vs. "a sample from an unrelated anecdote from the author's life". % of the total letter count just doesn't seem like the right notion of "smallness".
On the flip side, sometimes it's reversed: sometimes we do want to sample random unconnected atoms. E. g., the nanomachine example: if we happen to sample the "chunk" corresponding to appendage#12, we risk learning nothing about the high-level state, whereas if we sample three random atoms from different parts of it, that might determine the high-level state uniquely. So now the desired topology of the samples is different: we want non-connected chunks.

I'm currently thinking this is solved by abstraction hierarchies. Like, maybe the basic definition of an abstraction is of the "redundant synergistic variable" type, and the lowest-level abstractions are defined over the lowest-level elements (molecules over atoms). But then higher-level abstractions are redundant-synergistic over lower-level abstractions (rather than actual lowest-level elements), and up it goes. The definitions of the lower-level abstractions provide the topology + sizing + symmetries, which higher-level abstractions then hook up to. (Note that this forces us to actually step through the levels, either bottom-up or top-down.)

As examples:

The states of the nanomachines' modules are inferable from any subset of the modules' constituent atoms, and the state of the nanomachine itself is inferable from the states of any subset of the modules. But there's no such neat relationships between atoms and the high-level state.
"A carbon atom" is synergistic information about a chunk of voxels (baking-in how that chunk could vary, e. g. rotations, spatial translations); "a DNA molecule" is synergistic information about a bunch of atoms (likewise defining custom symmetries under which atom-compositions still count as a DNA molecule); "skin tissue" is synergistic over molecules; and somewhere up there we have "a dog" synergistic over custom-defined animal-parts.

Or something vaguely like that; this doesn't exactly work either. I'll have more to say about this once I finish distilling my notes for external consumption instead of expanding them, which is going to happen any... day... now...

When is it important that open-weight models aren't released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.

Thane Ruthenis6d10

I very tentatively agree with that.

I'd guess it's somewhat unlikely that large AI companies or governments would want to continue releasing models with open weights once they are this capable, though underestimating capabilities is possible

I think that's a real concern, though. I think the central route by which going open-source at the current capability level leads to extinction is a powerful AI model successfully sandbagging during internal evals (which seems pretty easy for an actually dangerous model to do, given evals' current state), getting open-sourced, and things then going the "rogue replication" route.

Natural Latents: The Concepts

Thane Ruthenis7d*31

I'm possibly missing something basic here, but: how is the redund/latent-focused natural-abstraction theory supposed to deal with synergistic information (and "emergent" dynamics)?

Consider a dog at the level of atoms. It's not, actually, the case that "this is a dog" is redundantly encoded in each atom. Even if each atom were clearly labeled, and we had an explicit approximately deterministic function, the state of any individual atom would constrain the output not at all. Atom#2354 being in a state #7532 is consistent with its comprising either a dog, or a cat, or an elephant...

This only stops applying if we consider macroscopically sized chunks of atoms, or the specific set of microscopically sized chunks corresponding to DNA.

And even that doesn't always work. Consider a precision-engineered nanomachine, with each atom accounted for. Intuitively, "the nanomachine's state" should be an abstraction over those atoms. However, there's not necessarily any comparatively miniscule "chunk" of the nanomachine that actually redundantly encodes its state! E. g., a given exact position of appendage#12 may be consistent either with resource-extraction or with rapid travel.

So: Suppose we have some set of random variables $X$ representing some cube of voxels where each voxel reports what atoms are in it. Imagine a dataset of various animals (or nanomachines) in this format, of various breeds and in various positions.

"This is a dog" tells us some information about $X$ : $H (X | dog) < H (X)$ . Indeed, it tells us a fairly rich amount of information: the general "shape" of what we should expect to see there. However, for any individual $X_{i}$ , $H (X_{i} | dog) \approx H (X_{i})$ .^[1] Which is to say: "this is a dog" is synergistic information about $X$ ! Not redundant information. And symmetrically, sampling a given small chunk of $X$ won't necessarily tell us whether it's the snapshot of a dog or a cat (unless we happen to sample a DNA fragment). $H (animal | X) = 0$ , but $H (animal | X_{i}) \approx H (animal)$ .

One way around this is to suggest that cats/dogs/nanomachines aren't abstractions over their constituent parts, but abstractions over the resampling of all their constituent parts under state transitions. I. e., suppose we now have 3D video recordings: then "this is a dog" is redundantly encoded in each $X (t)$ for $t \in [t_{start}, t_{end}]$ .

But that seems counterintuitive/underambitious. Intuitively, tons of abstractions are about robust synergistic information/emergent dynamics.

Is there some obvious way around all that, or it's currently an open question?

^{^}
Though it's not literally zero. E. g., if we have a fixed-size voxel cube, then depending on whether it's a dog or an elephant, we should expect the voxels at the edges to be more or less likely to contain air vs. flesh.

The best approaches for mitigating "the intelligence curse" (or gradual disempowerment); my quick guesses at the best object-level interventions

Thane Ruthenis16d915

I don't see how these help.

First, it seems to me that interoperability + advisors would be useless for helping people sociopolitically maneuver precisely up until the point the AI models are good enough to disempower most of them. Imagine some not-particularly-smart person with no AI expertise and not much mental slack for fighting some abstract battle about the future of humanity. A demographic of those people is then up against the legal teams of major corporations and the departments of major governments navigating the transition. The primary full-time jobs of the people working in the latter groups, at which they'd be very skilled, would be figuring out how to disempower the former demographics. In what case do the at-risk demographics stand any chance?

Well, if the AI models are good enough that the skills and the attentions of the humans deploying them don't matter. Which is to say: past the point at which the at-risk demographics are already disempowered.

Conversely, if the AI capabilities are not yet there, then different people can use AIs more or less effectively depending on how smart and skilled they are, and how much resources/spare attention for the fight they have. In which case the extant powers are massively advantaged and on-expectation win.

Second, I'm skeptical about feasibility. This requires, basically, some centralized distribution system which (1) faithfully trains AI models to be ultimately loyal to their end advisees, (2) fully subsidizes the compute costs for serving these model to all the economically useless people (... in the US? ... in the world?). How is that centralized system not subverted/seized by the extant powers? (E. g., the government insisting, in a way that sounds surface-level reasonable, on ultimate loyalty to its laws first, which it can then freely rewrite to have complete control.)

Like, suppose this system is set up some time before AI models are good enough to make human workers obsolete. As per the first point, the whole timespan prior to the AI models becoming that good would involve the entrenched powers successfully scheming towards precisely the gradual disempowerment we're trying to prevent. How do we expect this UBI-like setup^[1] to still be in place, uncorrupted, by the point AI models become good enough, given that it is the largest threat to the ability of extant powers to maintain/increase their power? This would be the thing everyone in power is trying to dismantle. (And, again, unless the AIs are already good enough to disempower the risk demographics, the risk demographics plus their AI representatives would be massively worse at fighting this battle than their opponents plus their AI representatives.)

Third: Okay, let's suppose the system is somehow in place, everyone in the world has a loyal AI representative advising them and advocating for their interests. What are these representatives supposed to do? Like, imagine some below-average-intelligence menial-labor worker hopelessly outclassed by robots. What moves is their AI representative supposed to make to preserve their power and agency? They don't really have any resources to bargain with; it's the ground truth.

Will the AIs organize them to march to the military bases/datacenters/politicians' and CEOs' homes and physically seize the crucial resources/take the key players hostage, or what?

There was a thought experiment going around on Twitter a few months back, which went:

Suppose people have to press one of two buttons, Blue and Red. If more than 50% of people press Blue, nobody dies. If more than 50% press Red, everyone who pressed Blue dies. Which button do you press?

Red is guaranteed to make you safe, and in theory, if everyone pressed Red, everyone would be safe. But if we imagine something isomorphic to this happening in reality, we should note that getting 100% of people to make the correct choice is incredibly hard. Some would be distracted, or confused, or they'd slip and fall and smack the wrong button accidentally.

To me, all variations on "let's give everyone in the world an AGI advisor to prevent disempowerment!" read as "let's all play Red!". The details don't quite match (50% survival/non-disempowerment rate is not at all guaranteed), but the vibe and the failure modes are the same.

^{^}
Which is also supposed to be implemented during Trump's term?

ryan_greenblatt's Shortform

Thane Ruthenis24d74

Employees at Anthropic don't think the RSP is LARP/PR. My best guess is that Dario doesn't think the RSP is LARP/PR.

Yeah, I don't think this is necessarily in contradiction with my comment. Things can be effectively just LARP/PR without being consciously LARP/PR. (Indeed, this is likely the case in most instances of LARP-y behavior.)

Agreed on the rest.

ryan_greenblatt's Shortform

Thane Ruthenis24d1225

I'd been pretty much assuming that AGI labs' "responsible scaling policies" are LARP/PR, and that if an RSP ever conflicts with their desire to release a model, either the RSP will be swiftly revised, or the testing suite for the model will be revised such that it doesn't trigger the measures the AGI lab doesn't want to trigger. I. e.: that RSPs are toothless and that their only purposes are to showcase how Responsible the lab is and to hype up how powerful a given model ended up.

This seems to confirm that cynicism.

(The existence of the official page tracking the updates is a (smaller) update in the other direction, though. I don't see why they'd have it if they consciously intended to RSP-hack this way.)

Thane Ruthenis's Shortform

Thane Ruthenis1mo30

The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them

Fair, also see my un-update edit.

Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I'd previously complained, I don't think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.

Thane Ruthenis's Shortform

Thane Ruthenis1mo21

Do you also dislike Moore's law?

That one seems to work more reliably, perhaps because it became the metric the industry aims for.

I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you'll be in trouble

My issue here is that there wasn't that much variance in the performance of all preceding models they benchmarked: from GPT-2 to Sonnet 3.7, they seem to almost perfectly fall on the straight line. Then, the very first advancement of the frontier after the trend-model is released is an outlier. That suggests an overfit model.

I do agree that it might just be a coincidental outlier and that we should wait and see whether the pattern recovers with subsequent model releases. But this is suspicious enough I feel compelled to make my prediction now.

Thane Ruthenis's Shortform

Thane Ruthenis1mo*6-3

Edit: I've played with the numbers a bit more, and on reflection, I'm inclined to partially unroll this update. o3 doesn't break the trendline as much as I'd thought, and in fact, it's basically on-trend if we remove the GPT-2 and GPT-3 data-points (which I consider particularly dubious).

Regarding METR's agency-horizon benchmark:

I still don't like anchoring stuff to calendar dates, and I think the o3/o4-mini datapoints perfectly show why.

It would be one thing if they did fit into the pattern. If, by some divine will controlling the course of our world's history, OpenAI's semi-arbitrary decision about when to allow METR's researchers to benchmark o3 just so happened to coincide with the 2x/7-month model. But it didn't: o3 massively overshot that model.^[1]

Imagine a counterfactual in which METR's agency-horizon model existed back in December, and OpenAI invited them for safety testing/benchmarking then, four months sooner. How different would the inferred agency-horizing scaling laws have been, how much faster the extrapolated progress? Let's run it:

o1 was announced September 12th, o3 was announced December 19th, 98 days apart.
o1 scored at ~40 minutes, o3 at ~1.5 hours, a 2.25x'ing.
There's ~2.14 intervals of 98 days in 7 months.
Implied scaling factor: each 7 months.

And I don't see any reasons to believe it was overdetermined that this counterfactual wouldn't have actualized. METR could have made the benchmark a few months earlier, OpenAI could have been more open about benchmarking o3.

And if we lived in that possible world... It's now been 135 days since December 19th, i. e., ~1.38 intervals of 98 days. Extrapolating, we should expect the best publicly known model would have the time horizon of $1.5 hours \times {2.25}^{1.38} = 4.59 hours$ . I don't think we have any hint that those exist.

So: in that neighbouring world in which OpenAI let METR benchmark o3 sooner, we're looking around and seeing that the progress is way behind the schedule.^[2]

To me, this makes the whole model fall apart. I don't see how it can follow any mechanistic model-based reality of what's happening, and as per the o3/o4-mini data points, it doesn't predict the empirical reality as well. Further, whether we believe that the progress is much faster vs. much slower than expected is entirely controlled by the arbitrary fact that METR didn't get to benchmark o3 in December.

I think we're completely at sea.

^{^}
o3's datapoint implies a 4x/7-month model, no? Correct me if I'm wrong:
- Sonnet 3.7 was released 24th of February, 2025; o3's System Card and METR's reports were released 16th of April, 2025: 51 days apart.
- Sonnet 3.7 is benchmarked as having 1-hour agency; o3 has 1.5x that, ~1.5-hour agency.
- 7 months contain 3.5 two-month intervals. This means that, if horizons extend as fast as they did between 3.7 and o3, we should expect a ${1.5}^{3.5} = 4.13$ x'ing of agency horizons each 7 months.
^{^}
Edit: Yes, counterfactual!METR wouldn't have used just those two last data points, so the inferred multiplier would've been somewhat less than that. But I think it would've still been bigger than 2x/7-months, and the graph would've been offset to the left (the 1.5-hour performance achieved much earlier), so we'd still be overdue for ~2.5-hour AIs. Half-a-year behind, I think?

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments