Greg_Colbourn ⏸️

The first concern is that Anthropic as an institution is net negative for the world (one can imagine various reasons for thinking this, but a key one is that frontier AI companies, by default, are net negative for the world due to e.g. increasing race dynamics, accelerating timelines, and eventually developing/deploying AIs that risk destroying humanity – and Anthropic is no exception), and that one shouldn’t work at organizations like that.
...
Another argument against working for Anthropic (or for any other AI lab) comes from approaches to AI safety that focus centrally/exclusively on what I’ve called “capability restraint” – that is, finding ways to restrain (and in the limit, indefinitely halt) frontier AI development, especially in a coordinated, global, and enforceable manner. And the best way to work on capability restraint, the thought goes, is from a position outside of frontier AI companies, rather than within them (this could be for a variety of reasons, but a key one would be: insofar as capability restraint is centrally about restraining the behavior of frontier AI companies, those companies will have strong incentives to resist it).
...
Another argument against AI-safety-focused people working at Anthropic is that it’s already sucking up too much of the AI safety community’s talent. This concern can take various forms (e.g., group-think and intellectual homogeneity, messing with people’s willingness to speak out against Anthropic in particular, feeding bad status dynamics, concentrating talent that would be marginally more useful if more widely distributed, general over-exposure to a particular point of failure, etc). I do think that this is a real concern – and it’s a reason, I think, for safety-focused talent to think hard about the marginal usefulness of working at Anthropic in particular, relative to non-profits, governments, other AI companies, and so on. [It’s also one of the arguments for thinking that Anthropic might be net negative, and a reason that thought experiments like “imagine the current landscape without Anthropic” might mislead.]
...
Another concern about AI-safety-focused people working at AI companies is that it will restrict/distort their ability to accurately convey their views to the public – a concern that applies with more force to people like myself who are otherwise in the habit of speaking/writing publicly.
...
A different concern about working at AI companies is that it will actually distort your views directly – for example, because the company itself will be a very specific, maybe-echo-chamber-y epistemic environment, and people in general are quite epistemically permeable.
...
And of course, there are also concerns about direct financial incentives distorting one’s views/behavior – for example, ending up reliant on a particular sort of salary, or holding equity that makes you less inclined to push in directions that could harm an AI company’s commercial success
...
A final concern about AI safety people working for AI companies is that their doing so will signal an inaccurate degree of endorsement of the company’s behavior, thereby promoting wrongful amounts of trust in the company and its commitment to safety.
...
relative to some kind of median Anthropic view, both amongst the leadership and the overall staff, I am substantially more worried about classic existential risk from misalignment
...
[14] There is at least some evidence that early investors in Anthropic got the impression that Anthropic was initially committed to not pushing the frontier – a commitment that would be odds with their current policy and behavior (though: I think Anthropic has in fact taken costly steps in the past to not push the frontier – see e.g. discussion in this article). If Anthropic made and then broke commitments in this respect, I do think this is bad and a point against expecting them to keep safety-relevant commitments in the future. And it’s true, regardless, that some of Anthropic’s public statements suggested reticence about pushing the frontier (see e.g. quotes here), and it seems plausible that the company’s credibility amongst safety-focused people and investors benefited from cultivating this impression. That said, the fact that Anthropic in fact took costly steps not to push the frontier suggests that this reticence was genuine – albeit, defeasible. And I think benefiting from stated and genuine reticence that ended up defeated is different from breaking a promise.
People have expressed concerns about Anthropic quietly revising/weakening the commitments in its Responsible Scaling Policy (see e.g. here on failing to define “warning sign evaluations” by the time they trained ASL-3 models, and here on weakening ASL-3 weight-theft security requirements so that they don’t cover employees with weight-access). I haven’t looked into this in detail, and I think it’s plausible that Anthropic’s choices here were reasonable, but I do think that the possibility of AI companies revising RSP-like policies, even in a manner that abides by the amendment procedure laid out in those policies (e.g., getting relevant forms of board/LTBT approval), highlights the limitations of relying on these sorts of voluntary policies to ensure safe behavior, especially as the stakes of competition increase.
I think it was bad that Anthropic used to have secret non-disparagement agreements (though: these have been discontinued and previous agreements are no longer being enforced). It also looks to me like Sam McCandlish’s comment on behalf of Anthropic here suggested a misleading picture in this respect, though he has since clarified.
I’ve heard concerns that Anthropic’s epistemic culture involves various vices – e.g. groupthink, over-confidence about how much the organization is likely to prioritize safety when it deviates importantly from standard commercial incentives, over-confidence about the degree of safety the organization’s RSP is likely to ultimately afford, general miscalibration about the extent to which Anthropic is especially ethically-driven vs. more of a standard company – and that the leadership plays an important role in causing this. This one feels hard for me to assess from the outside (and if true, some of the vices at stake are hardly unique to Anthropic in particular). I’m planning to see what I think once I actually see the culture up close.
I also think it’s true, in general, that Anthropic’s researchers have played a meaningful role in accelerating capabilities in the past – e.g. Dario’s work on early GPTs.

...
I think Anthropic itself has a serious chance of causing or playing an important role in the extinction or full-scale disempowerment of humanity – and for all the good intentions of Anthropic’s leadership and employees, I think everyone who chooses to work there should face this fact directly.
...
I do not think that Anthropic or any other actor has an adequate plan for building superintelligence in a manner that brings the risk of catastrophic, civilization-ending misalignment to a level that a prudent and coordinated civilization would accept.
...
I think this plan is quite a bit more promising than some of its prominent critics do. But it is nowhere near good enough, and thinking it through in such detail has increased my pessimism about the situation. Why? Well, in brief: the plan is to either get lucky, or to get the AIs to solve the problem for us. Lucky, here, means that it turns out that we don’t need to rapidly make significant advances in our scientific understanding in order to learn how to adequately align and control superintelligent agents that would otherwise be in a position to disempower humanity – luck that, for various reasons, I really don’t think we can count on. And absent such luck, as far as I can tell, our best hope is to try to use less-than-superintelligent AIs – with which we will have relatively little experience, whose labor and behavior might have all sorts of faults and problems, whose output we will increasingly struggle to evaluate directly, and which might themselves be actively working to undermine our understanding and control – to rapidly make huge amounts of scientific progress in a novel domain that does not allow for empirical iteration on safety-critical failures, all in the midst of unprecedented commercial and geopolitical pressures. True, some combination of “getting lucky” and “getting AI help” might be enough for us to make it through. But we should be trying extremely hard not to bet the lives of every human and the entire future of our civilization on this. And as far as I can tell, any actor on track to build superintelligence, Anthropic included, is currently on track to make either this kind of bet, or something worse.
...
I do not believe that the object-level benefits of advanced AI^[18] – serious though they may be – currently justify the level of existential risk at stake in any actor, Anthropic included, developing superintelligence given our current understanding of how to do so safely.
...
I think that in a wiser, more prudent, and more coordinated world, no company currently aiming to develop superintelligence – Anthropic included – would be allowed to do so given the state of current knowledge.
...
I think it’s possible that there will, in fact, come a time when Anthropic should basically just unilaterally drop out of the race – pivoting, for example, entirely to a focus on advocacy and/or doing alignment research that it then makes publicly available. And I wish I were more confident that in circumstances where this is the right choice, Anthropic will do it despite all the commercial and institutional momentum to the contrary.
...
if, as a result, I end up concluding that working at Anthropic is a mistake, I aspire to simply admit that I messed up, and to leave.
...
When I think ahead to the kind of work that this role involves, especially in the context of increasingly dangerous and superhuman AI agents, I have a feeling like: this is not something that we are ready to do. This is not a game humanity is ready to play. A lot of this concern comes from intersections with the sorts of misalignment issues I discussed above. But the AI moral patienthood piece looms large for me as well, as do the broader ethical and political questions at stake in our choices about what sorts of powerful AI agents to bring into this world, and about who has what sort of say in those decisions.

Joe goes on to provide counters to these, but imo those counters are much weaker than the initial considerations against Anthropic. It's like he's tying himself in knots to justify taking the job, when he already knows, deep down, that it's unconscionable.

Leaving Open Philanthropy, going to Anthropic

Greg_Colbourn ⏸️ 15h-2

This is incredible: it reads as a full justification for not working at Anthropic, yet the author concludes the opposite!

Which side of the AI safety community are you in?

Greg_Colbourn ⏸️ 9d2

That's good to see, but the money, power and influence is critical here^[1], and that seems to be far too corrupted by investments in Anthropic, or just plain wishful techno-utopian thinking.

^{^}
The poll respondents are not representative of that for EA. There is no one representing OpenPhil, CEA or 80k, no large donors, and only one top 25 karma account.

Statement on Superintelligence - FLI Open Letter

Greg_Colbourn ⏸️ 13d9

There is widespread discontent at the current trajectory of advanced AI development, with only 5% in support of the status quo of fast, unregulated development;
Almost two-thirds (64%) feel that superhuman AI should not be developed until it is proven safe and controllable, or should never be developed;
There is overwhelming support (73%) for robust regulation on AI. The fraction opposed to strong regulation is only 12%.

[Source]. I imagine global public opinion is similar. What we need to do now is mobilise a critical mass of that majority. If you agree, please share the global petition far and wide (use this version for people who want their name to be public).

A Reply to MacAskill on "If Anyone Builds It, Everyone Dies"

Greg_Colbourn ⏸️ 1mo2

(I think if we’d gotten to human-level algorithmic efficiency at the Dartmouth conference, that would have been good, as compute build-out is intrinsically slower and more controllable than software progress (until we get nanotech). And if we’d scaled up compute + AI to 10% of the global economy decades ago, and maintained it at that level, that also would have been good, as then the frontier pace would be at the rate of compute-constrained algorithmic progress, rather than the rate we’re getting at the moment from both algorithmic progress AND compute scale-up.)

This is an interesting thought experiment. I think it probably would've been bad, because it would've initiated an intelligence explosion. Sure, it would've started off very slow, but it would've gathered steam inexorably, speeding tech development, including compute scaling. And all this before anyone had even considered the alignment problem. After a couple of decades perhaps humanity would already have been gradually disempowered past the point of no return.

A Reply to MacAskill on "If Anyone Builds It, Everyone Dies"

Greg_Colbourn ⏸️ 1mo5

the better strategy of focusing on the easier wins

I feel that you are not really appreciating the point that such "easier wins" aren't in fact wins at all, in terms of keeping us all alive. They might make some people feel better, but they are very unlikely to reduce AI takeover risk to, say, a comfortable 0.1% (In fact I don't think they will reduce it to below 50%).

I think I’m particularly triggered by all this because of a conversation I had last year with someone who takes AI takeover risk very seriously and could double AI safety philanthropy if they wanted to. I was arguing they should start funding AI safety, but the conversation was a total misfire because they conflated “AI safety” with “stop AI development”: their view was that that will never happen, and they were actively annoyed that they were hearing what they considered to be such a dumb idea. My guess was that EY’s TIME article was a big factor there.

Well hearing this, I am triggered that someone "who takes AI takeover risk very seriously" would think that stopping AI development was "such a dumb idea"! I'd question whether they do actually take AI takeover risk seriously at all. Whether or not a Pause is "realistic" or "will never happen", we have to try! It really is our only shot if we actually care about staying alive for more than another few years. More people need to realise this. And I still don't understand how people can think that the default outcome of AGI/ASI is survival for humanity, or an OK outcome.

...the question is how one can be so confident that any work we do now (including with ~AGI assistance, including if we’ve bought extra time via control measures and/or deals with misaligned ~AGIs) is insufficient, such that the only thing that makes a meaningful difference to x-risk, even in expectation, is a global moratorium. And I’m still not seeing the case for that.

I'd flip this completely, and say: the question is why we should be so confident that any work we do now (including with AI assistance, including if we’ve bought extra time via control measures and/or deals with misaligned AIs) is sufficient to solve alignment, such that the only thing that makes a meaningful difference to x-risk, even in expectation, a global moratorium, is unnecessary. I’m still not seeing the case for that.

On January 1, 2030, there will be no AGI (and AGI will still not be imminent)

Greg_Colbourn ⏸️ 6mo2

It just looks a lot like motivated reasoning to me - kind of like they started with the conclusion and worked backward. Those examples are pretty unreasonable as conditional probabilities. Do they explain why "algorithms for transformative AGI" are very unlikely to meaningfully speed up software and hardware R&D?

On January 1, 2030, there will be no AGI (and AGI will still not be imminent)

Greg_Colbourn ⏸️ 6mo4

Saying they are conditional does not mean they are. For example, why is P(We invent a way for AGIs to learn faster than humans|We invent algorithms for transformative AGI) only 40%? Or P(AGI inference costs drop below $25/hr (per human equivalent)^[1]|We invent algorithms for transformative AGI) only 16%!? These would be much more reasonable as unconditional probabilities. At the very least, "algorithms for transformative AGI" would be used to massively increase software and hardware R&D, even if expensive at first, such that inference costs would quickly drop.

^{^}
As an aside, surely this milestone has basically now already been reached? At least for the 90% percentile human in most intellectual tasks.

On January 1, 2030, there will be no AGI (and AGI will still not be imminent)

Greg_Colbourn ⏸️ 6mo4

If they were already aware, they certainly didn't do anything to address it, given their conclusion is basically a result of falling for it.

It's more than just intuitions, it's grounded in current research and recent progress in (proto) AGI. To validate the opposing intuitions (long timelines) requires more in the way of leaps of faith (to say that things will suddenly stop working as they have been). Longer timelines intuitions have also been proven wrong consistently over the last few years (e.g. AI constantly doing things people predicted were "decades away" just a few years, or even months, before).

On January 1, 2030, there will be no AGI (and AGI will still not be imminent)

Greg_Colbourn ⏸️ 6mo4

I found this paper which attempts a similar sort of exercise as the AI 2027 report and gets a very different result.

This is an example of the multiple stages fallacy (as pointed out here), where you can get arbitrarily low probabilities for anything by dividing it up enough and assuming things are uncorrelated.

Greg_Colbourn ⏸️

Bio

Participation4

Posts 30

Comments1175

Participation
4

Posts
30

Comments
1175