Owen Cotton-Barratt

But yeah, the main use case we had in mind for the monitoring layer was not about these very tricky-to-observe states, but expanding the space of things you can make agreements about (potentially including some high-stakes cases, as I write about at the end of this story: https://strangecities.substack.com/p/some-days-soon).

Defense-favoured coordination design sketches

Owen Cotton-Barratt4d2

This is basically the reason I regard this as the most technically challenging of the things we're presenting here. You eventually want a system which is not just a passive consumer of data, but can actively explore. You may need to give it access to robots with cameras and internet so that it can verify some of the basics of its setup. It might still fear that the entire thing is being spoofed, but I think it's vastly harder to generate a plausible world that's robust to the agent exploring and running consistency probes.

Defense-favoured coordination design sketches

Owen Cotton-Barratt5d2

I'm kind of unsure which of the sketches you're talking about with this question. Could you ask it of whichever one you feel it's cleanest for?

Defense-favoured coordination design sketches

Owen Cotton-Barratt6d2

Plausible, yes. For one thing you can run versions of the coordination tech in parallel with old cheap models, and flag and dig into discrepancies. This could make it harder for misalignment to strongly bite.

Of course if there are big misalignment issues and we're not seriously tracking that there could be big misalignment issues, that's gonna be a problem.

Defense-favoured coordination design sketches

Owen Cotton-Barratt6d6

I feel like you're baking a lot into this clause:

With AI delegates, they would presumably be verifiable and would be programmed to tell the truth and keep to deals

I think that aiming for an equilibrium where that's true would be good, but I'm not certain that's the starting point (and if it were otherwise going to scupper getting this off the ground, it probably shouldn't be the starting point).

So if one person adopts the AI delegate and another doesn't, then the human can overexaggerate their preferences, withhold information, and even defect on the deal (without blatantly lying), but a verifiable AI delegate presumably wouldn't be able to do that?

I see no reason why an AI delegate shouldn't be able to withhold information. I agree that people might want delegates that could do the other things too, but I think that it might be better for the human principal if it couldn't -- it can develop a reputation as trustworthy (in a way that's hard for an individual human to develop enough of a reputation for because others don't get enough track record).

Defense-favoured coordination design sketches

Owen Cotton-Barratt6d4

I agree that there are significant concerns here! FWIW I'm more concerned about the adversarially-manipulated layer (at least at something needing attention now). I think that a lot of these applications could work with systems that aren't much stronger than what we have today; but that getting effective misaligned scheming would require a significant step up in capabilities. (You might have weaker forms of misalignment, but I think that those are pretty similar to "the systems just aren't really good enough yet".)

Defense-favoured coordination design sketches

Owen Cotton-Barratt6d4

We didn't; although two of us were involved in running the AI for Human Reasoning fellowship, and some of the fellows on that did.

I think the reasons we didn't go deeper on this are basically a mix of:

Eh, I'm not sure how much signal you get from the simple prototypes. Like for sure you can get some, but mostly what you're testing is "Are the LLMs already good enough that they can be quite useful here even with minimal scaffolding?"
A lot of the research was done 6-9 months ago (when Claude Code was significantly weaker)
Questions of comparative advantage -- it being unclear that we're the best people to be exploring this (although I agree that Claude Code makes this more plausible than it would have been in the past)
We didn't want to let the perfect be the enemy of the good -- indeed in many ways Claude Code improvements make it more attractive to get out, since it's more plausible that someone else will casually run with and develop one of these ideas

AI should be a good citizen, not just a good assistant

Owen Cotton-Barratt14d11

Could you comment on the sense of "should" you have in mind in this post?

I think your core thesis is something like "it would be more socially efficient for AI systems to have prosocial drives". (I lean agree.)

But then sometimes you write as though the implication is "AI companies should unilaterally implement more prosocial drives in their systems". And this feels much less obvious to me.

If the purchasers of AI services prefer them to not have prosocial drives, then this could be imposing values on the consumers (which might ultimately have the effect of driving people to other AI providers). You might think that it's worth while to have that kind of imposition for a period, and socially-minded AI companies should do it -- but if that's the heart of the claim I really think you should be explicit about it (and that AI companies should justifiably be less willing to listen to you if you're not).

Another angle might be: if we ultimately want AI systems to have prosocial drives, we might think about how that should be incentivized -- whether by trying to shape consumer preference, by legal mandate, or by economic gradients (e.g. differing tax rates according to the degree of prosociality).

Anyway possible I'm missing something here! Would love to hear how you're thinking about this question.

Which questions can’t we punt?

Owen Cotton-Barratt21d9

Thanks, I agree with your mathematics and think this framework is helpful for letting us zoom in to possible disagreements.

There are two places where I find myself sceptical of the framing in your comment:

You're framing it as something like a haircut on the value of working on a particular topic. But I think that if you're targeting just some set of worlds where we don't get meaningful automated assistance for certain kinds of strategy work, it might be important to explicitly condition on that, rather than just think of it as a haircut.
- We might then also ask about whether we have more or less leverage on these worlds. Some takes:
  - Of course it will depend a bunch on the specifics of the question we're wondering whether we can punt.
  - Broadly I think worlds where we don't get a bunch of automated assistance for strategy in a timely fashion look significantly worse than worlds where we do get this assistance.
    - This is compatible with either being higher-leverage.
    - An important type of leverage we may is the possibility of moving worlds from the first bucket to the second.
  - I'm actually pretty unsure which worlds we have more leverage over (there seem to be quite a lot of considerations pointing each way), and suspect that this question is more to-the-point than thinking of it as a haircut.
You say that it's hard to reach or exceed a 90% chance. I find myself dubious of this. If we see something like a technological singularity, there will be an enormous amount of progress on very many dimensions. Some of the things that actors will eventually want to do will be tremendously harder than others. I certainly don't want to assume that people will get around to doing things that are a good idea and are possible as early as it's feasible for them to do so; but I think that if you go somewhat past that to the point where it's very easy to get the good thing, it's not hard for it to be a >90% safe bet to think it will happen.
- As an analogy, I imagine people before the industrial revolution might reasonably have predicted that there would be a lot more capacity for thinking about space strategy before anyone went to the moon.

Maybe there's a common theme here: I have the impression that I'm more imagining a default world where we get these upgrades to strategic capacity in a timely fashion, and then considering deviations from that; and you're more saying "well maybe things look like that, but maybe they look quite different", and less privileging the hypothesis.

I guess I do just think it's appropriate to privilege this hypothesis. We've written about how even current or near-term AI could serve to power tools which advance our strategic understanding. I think that this is a sufficiently obvious set of things to build, and there will be sufficient appetite to build them, that it's fair to think it will likely be getting in gear (in some form or another) before most radically transformative impacts hit. I wouldn't want to bet everything on this hypothesis, but I do think it's worth exploring what betting on it properly would look like, and then committing a chunk of our portfolio to that (if it's not actively bad on other perspectives).

[Linkpost] Should we make grand deals about post-AGI outcomes?

Owen Cotton-Barratt1mo10

You discuss the idea of clauses that allow for later escape from poorly-conceived deals as a guardrail. This feels like a powerful possibility which might add a significant amount of robustness.

But I'm wondering if the idea might be more broadly applicable than that. If we have the kind of machinery that allows us to add that kind of clause, maybe we could use it for the whole essence of the deal? Rather than specify up front what you wish to exchange, just specify the general principles of exchange -- and trust the smarter and wiser actors of the future to interpret it in a fair and benevolent manner.

In general I think reading this article I'm finding that I have some sympathy for the central claim that there could be useful deals to strike early (that it isn't possible to strike later); however I find myself feeling quite sceptical of the frameworks for thinking about different types of deals etc. -- I don't see why we shouldn think that we have done more here than scrape the surface of the universe of possibilities, and my best guess is that actually-wise deals would look quite different than anything you're outlining. Curious what you make of this -- does this feel too radically sceptical or something?

Owen Cotton-Barratt

Posts 86

Sequences 3

Comments980

Topic contributions3

Posts
86

Sequences
3

Comments
980

Topic contributions
3