Holden Karnofsky on dozens of amazing opportunities to make AI safer — and all his AGI takes

80000_Hours

By Robert Wiblin | Watch on Youtube | Listen on Spotify | Read transcript

Episode summary

Whatever your skills are, whatever your interests are, we’re out of the world where you have to be a conceptual self-starter, theorist mathematician, or a policy person — we’re into the world where whatever your skills are, there is probably a way to use them in a way that is helping make maybe humanity’s most important event ever go better.

— Holden Karnofsky

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy — lists 39 projects he’s excited to see happening, including:

Training deceptive AI models to study deception and how to detect it
Developing classifiers to block jailbreaking
Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training
Developing policies on model welfare, AI-human relationships, and what instructions to give models
Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

All this low-hanging fruit is one factor behind his decision to join Anthropic this year. That said, his wife was also a cofounder and president of the company, giving him a big financial stake in its success — and making it impossible for him to be seen as independent no matter where he worked.

Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.

Outside critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in.

“I work at an AI company, and a lot of people think that’s just inherently unethical,” he says. “They’re imagining [that] everyone wishes they could go slowly, but they’re going fast so they can beat everyone else. […] But I emphatically think this is not what’s going on in AI.”

The reality, in Holden’s view:

I think there’s too many players in AI who […] don’t want to slow down. They don’t believe in the risks. Maybe they don’t even care about the risks. […] If Anthropic were to say, “We’re out, we’re going to slow down,” they would say, ‘This is awesome! Now we have a better chance of winning, and this is even good for our recruiting” — because they have a better chance of getting people who want to be on the frontier and want to win.

Holden believes a frontier AI company can reduce risk by:

Developing cheap, practical safety measures other companies might adopt
Prototyping policies regulators could mandate
Gathering crucial data about what advanced AI can actually do

Host Rob Wiblin and Holden discuss the case for and against those strategies, and much more, in today’s episode.

This episode was recorded on July 25 and 28, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore

The interview in a nutshell

Holden Karnofsky, who cofounded GiveWell and Open Philanthropy before joining Anthropic, believes humanity is handling the arrival of AGI about as badly as possible. Despite this, he puts roughly 50/50 odds on things going well even if we remain reckless — though he emphasises we’re taking far more risk than we should.

1. The AI race is NOT a coordination problem — too many actors genuinely want to race

Holden strongly rejects the common view that AI development is a prisoner’s dilemma where everyone wishes they could slow down but feel forced to race:

Many players don’t want to slow down at all: Some actors don’t believe in the risks, and others may just be fine with the risks.
If safety-minded people withdrew, little would change: Even if everyone concerned about AI risk left the field entirely, progress would slow only slightly — the remaining actors would eagerly fill the gap.
This isn’t about defecting in a prisoner’s dilemma: “If you’re in the wilderness and there’s a bunch of tigers trying to kill you, and then you try to kill the tigers, you didn’t just defect in a prisoner’s dilemma.”

2. Anthropic demonstrates it’s possible to be both competitive and safety focused

Holden argues Anthropic is successfully walking the “thin line” of racing to win while prioritising safety.

He discusses three models of impact unique to frontier companies:

Develop and export cheap safety measures: Create risk-reducing techniques that are practical and compatible with being competitive, then get others to adopt them.
Race to the top on responsibility: Create recruiting advantages for responsible companies, pressuring competitors to improve their safety practices.
Inform the world: Use your position with powerful models to understand and share evidence about AI risks.

Concrete safety work includes:

Alignment research (now tractable with real model organisms to study)
Security improvements (focusing on preventing sabotage and backdoors, not just model weight theft)
Model specs defining how AIs should behave
Constitutional Classifiers making jailbreaking much harder
Monitoring systems like Clio for tracking AI use while preserving privacy

The key insight: “If you play your cards right, you can pay very small amounts of so-called [safety] tax and have very big safety benefits” — making safety measures cheap enough that other companies will adopt them.

3. Success is possible even with terrible execution — but we shouldn’t count on it

Holden outlines a “success without dignity” scenario where humanity muddles through despite incompetence:

Phase 1: Human-level-ish AI. AIs might be somewhat misaligned but not uniformly power-seeking, and we’d get useful work from them with “relatively simple measures.”
Phase 2: We use that window productively. Even a few months with human-level AI could yield “millions of person-years” of work — producing dramatically better alignment research, risk assessment, and safety technologies before superintelligence arrives.

Partial victories matter: Even if only some companies adopt safety measures, historically, having “90% of resources on the side of good” usually wins — and multiple responsible companies could have their AIs detect and counter rogue AIs.

4. Concrete work in AI safety is now tractable and measurable

The field has transformed from frustrating conceptual work to concrete engineering problems — technical work with real feedback loops:

Alignment research: Study “model organisms” (AIs trained to be malign) and test interventions
Capability evaluations: Can AIs hide their capabilities or sabotage evaluations?
Security: Prevent backdoors and secret loyalties, not just model theft
Model welfare: Give AIs options to exit conversations, understand their preferences
AI relationships: Track and potentially discourage intense human–AI bonds that could aid takeover

The key change: “If you’re trying to solve a problem, you’ll know if you solved it” — most work now has measurable outcomes rather than just trying to convince others of potential risks.

Career advice:

Look for jobs at organisations whose vibe and mission excite you.
Choose based on personal fit when impact estimates are noisy.
“If you haven’t tried [to get a job in AI safety]… that’s insane. You should at least take a look.”
The field needs diverse skills, not just ML researchers — legal, policy, security, and operations all matter.

Highlights

The world is handling AGI about as badly as possible

Rob Wiblin: How far off of a rational response to the potential arrival of AGI and superintelligent machines is the world and the United States, in your view? If you could just say how things ought to be, would we be doing something radically different as a species?
Holden Karnofsky: Like on a scale of 0 to 10, how well is humanity handling this or something? I don’t know. Probably pretty close to a 0.
Look, it’s a nuanced thing, because I don’t want to come off as someone who generally is suspicious of technology. I do think with almost all technologies, it would be good to handle them this way. There’s a lot of technology to regulate the heck out of that I wish we were treating this way — where it’s just like YOLO, go ahead with it, race to do it, and then we’ll see what problems happen and we’ll address them as they come. There’s a lot of tech I wish we were doing that with.
But I do think AI is different. I think AI is special. The way I think of it is like, we are potentially about to introduce the second advanced species ever. There’s one species ever that we know of — humans — that can transform the world, make its own technology, do any of a very long list of things. We’re about to create the second one ever, and it will be the only one besides us.
And it’s like, is that thing going to be having a good time? Are we going to be treating it well? Is that thing going to be in line with our values or is it going to take over the world and do something else with it? Is that thing going to be too loyal to us? And is it going to put psychopaths in charge of the universe?
There’s a million other questions I could ask. It’s not even just that stuff. It’s like, what’s going to happen to people’s mental health when they have a whole new species that they’re interacting with? Have we thought about that? Is that species just optimised for clicking and engagement the way that social media is?
And what are we doing? We’re just racing. We’re just racing to do this as fast as we possibly can. It’s a whole bunch of parties that are just trying to do it fast, do it for maximum money so they can make money.
I’m not against that framework for many technologies. That seems like a really, really bad way to handle this technology.
Rob Wiblin: I also kind of have the YOLO approach to almost all technologies, with maybe two exceptions: there’s human-level and superintelligent machines, and there’s creating new diseases to study diseases. I think those are maybe the only two, where I think that we should tread incredibly carefully on these two issues. Then I guess there’s a couple of other things where I’m not sure whether this is helpful or harmful; I’m not sure whether I’ll put my money into this, because it could be net neutral.
But the great majority of things, I’m just like, let’s just have at it. We’ll solve the problems as they come along.
Holden Karnofsky: Yeah, that’s exactly where I’m at.

Is rogue AI takeover easy or hard?

Rob Wiblin: So if we do cross the threshold to significantly superhuman machine intelligence, people have very different intuitions about how likely a superintelligence is to be able to potentially overpower the rest of humanity and take over if it had some motivation to do that. Do you have a take on whether that is, on the more straightforward side, something that is quite plausible? Or do you think, like I guess some of the sceptics, that it’s enormously outnumbered, it has many strategic disadvantages, so it’s actually fairly unlikely?
Holden Karnofsky: My guess is that a lot of the obstacles they might run into might have to do with coordinating with each other and stuff, like different AIs at different companies. I think if we did have a bunch of AIs, and almost all the AIs were trying to take over, even if they were trying to take over for different reasons, I think we would be in a whole tonne of trouble — especially if the AIs were very superhuman in capabilities.
A thing that I also have thought about as part of the threat modelling is: What does this look like on the early side? What would the AI’s strategy be when it’s not yet incredibly superhuman? What kind of things could it do to lay the groundwork and put us in a worse position? I think that’s also important, because if we can make takeover really hard or basically impossible at that early stage, when it’s kind of human-level-ish, that may put us in a much better position to get useful work out of it and put ourselves in a better position later.
The kind of starting point, the ideal that I would be thinking about, is there’s a very good Onion article called something like, “FBI uncovers al-Qaeda plot to just sit back and enjoy collapse of United States.” I think that really captures how, in many ways, the optimal strategy for AI is: do absolutely nothing. Be as helpful and harmless and honest as you possibly can be. Don’t ever give anyone a reason to think that you’re doing anything bad, that you want anything but to help all the humans, and actually don’t do anything to hurt anyone. Just wait.
I think if you do that, basically what’s going to happen is you’re going to be put in charge of more and more stuff, and people are going to make you more and more powerful, and they’re going to make the capabilities explosion happen on their own.
And I think in some ways it puts people concerned about alignment in an almost impossible position. Because what are we going to do? I mean, we have maybe some interpretability signs that you’re thinking about deception or something, but it’s kind of unlikely that we have something really convincing if there’s never any bad behaviour by the AI.
So I think that’s an interesting starting point, that “do nothing” could be a successful strategy. You do nothing until you have a very extreme level of capability and a very extreme level of economic penetration. You’re basically writing everyone’s code, you’re in charge of everything, you’re doing everything. Everyone’s learned to trust you, and then things are over before they start.

The AGI race isn't a coordination failure

Rob Wiblin: So, at least for people who are anxious about the arrival of AGI, the most common mental model that they have in their head of the strategic situation is that it’s sort of a coordination problem or a prisoner’s dilemma in some way — where each individual AI company or each country that is thinking about its AI programme would probably prefer to go somewhat slower, but they feel pressure to go faster than they think is safe or that they feel is comfortable with because they have to keep up with everyone else. Otherwise they end up ceding influence, ceding strategic control.
But you actually don’t think that this is what’s going on. You don’t think it’s a coordination problem, primarily. Why is that?
Holden Karnofsky: I find this view really weird. I don’t understand this coordination problem idea, and I think first I want to talk a little bit about why it matters.
I work at an AI company, and I think a lot of people think that’s just inherently unethical. I think the AI company that I work at should care a lot about winning the race, being competitive, being on the frontier, keeping up with others. A lot of people think it is just not even necessarily consequentially bad, but just inherently unethical to think that way, and inherently unethical to care about that. That I should work for an organisation that is just trying to raise awareness about the risks and make people do safety stuff and is not at all involved in building AI.
So why do they think that? I think the intuition is what you’re saying: they’re imagining that you’re in this situation where everyone wishes that they could go slowly, but they’re going fast so they can beat everyone else. If we could all coordinate, then we would do something totally different. So if you’re racing, then you’re part of the problem.
And look, I do agree with the general model of morality there. When that is the situation — something like littering or something — when the world would be better off if people who think like you did what you did, then that is a good way to think about what’s ethical and unethical.
But I emphatically think this is not what’s going on in AI. I think it’s not at all what’s going on in AI. I think it explains almost nothing about the challenge of the AI risk problem. And the reason for that is that I think there’s too many players in AI who do not have that attitude: they don’t want to slow down, they don’t believe in the risks. Maybe they don’t even care about the risks. I think there’s probably some people who are fine with, “Maybe there’s a 10% chance I’m going to destroy the whole world. And maybe there’s also a 10% chance I’m going to win and have the most successful business of all time. That’s great. That’s a great deal.”
Sorry, I don’t want that to be taken out of context. That’s not how I feel. I think there may be people who feel that way. And even people who don’t feel that way, I think there’s just people who vastly disagree with me on how high the stakes are and what are the bad things that might happen.
So these are not people who wish they could slow down and would slow down if others were slowing down. You can’t apply… There’s this idea of non-causal decision theory. So instead of thinking about, “If I act, what’s going to happen?” you think about, “If everyone like me acted the way I did, then what would happen?” But that doesn’t help you here. There’s no correlation, or very little correlation, between whether I decide to take myself out of the AI race and whether a lot of other people decide to take themselves out of the AI race.
I think most of the players in AI are going to race. And if, for example, Anthropic were to say, “We’re out. We’re going to slow down,” they would say, “This is awesome! That’s the best news! Now we have a better chance of winning, and this is even good for our recruiting” — because they have a better chance of getting people who want to be on the frontier and want to win.
So I don’t understand this model. Help me understand this model. I don’t get it.
Rob Wiblin: Sure. Well, let’s say that Anthropic did drop out. Do you think that other companies — OpenAI, Google DeepMind, xAI — would slow down at all? Is there any sense in which this model is at least partially true, where they might feel somewhat less competitive pressure now? So inasmuch as there was a tradeoff between how much risk they have — even of just creating a PR crisis by deploying things too quickly — and staying competitive, do you think that they would slow down in any way, given that one of the maybe five top players had dropped out?
Holden Karnofsky: Well, I think AI progress would slow, because you’d have a bunch of people who are working on AI capabilities now and they’re not working on AI capabilities. I think that would be most of the effect.
Let’s take an even stronger hypothetical. Let’s say that not only Anthropic, but everyone in the world who thinks roughly the way I do — everyone in the world who thinks AI is super dangerous, and it would be ideal if the world would move a lot slower, which I do think — let’s say that everyone in the world who thinks that decided to just get nowhere near an AI company, nowhere near AI capabilities. I expect the result would be a slight slowing down, but not a large slowing down.
I think there’s just plenty of players now who want to win, and they are not thinking the way we are, and they will snap up all the investment and capital and a lot of the talent. And the main effect will be that there is a bunch of talent that works on capabilities motivated by safety. When that talent’s gone, there will be less total talent on capabilities, things will move slower.
All else equal, I’d like things to move slower. On net, I think this would be a bad thing.

Lessons from farm animal welfare we can use in AI

Rob Wiblin: So what’s the case for focusing on individual companies and actors rather than trying to influence it through mandatory government policy?
Holden Karnofsky: As far as I can tell, there’s no way to get to an actual low-level risk from AI without government policy playing an important role. We have these systems that could be very dangerous, and there’s this immature science of making them safer, and we have not really figured out how to make them safer. We don’t know if we’ll be able to make them safer. The only way to get really safe, to have high assurance, would be to get out of a race dynamic and to have it be that everyone has to comply with the same rules.
So that’s all well and good, but I will tell a little story about Open Philanthropy. When we got interested in farm animal welfare, at the time, a lot of people who were interested in farm animal welfare were doing the following things:
They were protesting with fake blood and stuff, the kind of thing PETA does.
They were trying to convince people to become vegan. One of the most popular interventions was handing out leaflets trying to convince individuals not to eat meat.
They were probably aiming to get to a world where people want to ban factory farming legally.
And we hired Lewis [Bollard], who had a whole different idea. It wasn’t just his idea; it was something that farm animal advocates were working on as well. But he said that if we target corporations directly, we’re going to have more success.
And basically what happened over the next several years was that advocates funded by Open Phil would go to a corporation and they’d say, “Will you make a pledge to have only cage-free eggs?” This could be a grocer or a fast food company. And very quickly, and especially once the domino effect started, the answer would be yes, and there would be a pledge.
Since then, some of those pledges have been adhered to; when not, there’s been more protests, there’s been more pressure. And in general, [adherence has been pretty good])https://farmanimalwelfare.substack.com/p/crunch-time-for-cage-free), like 50%, maybe more. You probably have Lewis on occasionally, so he could talk about that. But I would generally say this has been the most successful programme Open Phil has had in terms of some kind of general impact or changing the world.
You could get better effects if you had regulation, if you were targeting regulation in animal welfare — but the tractability is massively higher of changing companies’ behaviour. It was just a ridiculous change. Any change that’s happening in government, you’ve got a million stakeholders, everyone’s in the room, everyone’s fighting with everyone else. Every line of every law is going to get fought over.
And what we found in animal welfare — I’m not saying it’ll be the same in AI, but it’s an interesting analogy — is that 10 protesters show up, and the company’s like, “We don’t like this. This is bad PR. We’re doing a cage-free pledge.” This only works because there are measures that are cheap for the companies that help animals nontrivially.
And you have to be comfortable with an attitude that the goal here is not to make the situation good; the goal is to make the situation better. You have to be OK with that, and I am OK with that. But in farm animal welfare, I think what we’ve seen is that that has been a route to doing a lot more good, and I think people should consider a similar but not identical model for AI.
An interesting thing you said: you said maybe people should be pressuring companies from the inside, pressuring them from the outside. I think you left something out, which is maybe people should be working out what companies could do that would be a cheap way to reduce risks. This is analogous to developing the cage-free standard or developing the broiler chicken standard, which is another thing that these advocates pushed for.
I think that is a huge amount of work that has to be done, but I do fundamentally feel that there’s a long list of possibilities for things that companies could do — that are cheap, that don’t make them lose the race, but that do make us a lot safer. And I think it’s a shame to leave that stuff on the table because you’re going for the home run.

The case for working at Anthropic

Rob Wiblin: So you’ve decided to go work at Anthropic now. What is the case that Anthropic as a company — and the staff working there, by extension — are having a really big positive impact on hopefully guiding us towards a positive outcome from the arrival of artificial general intelligence?
Holden Karnofsky: Well, I certainly think they are. But I want to give a couple caveats before I answer that. One is I’m married to the president and cofounder of Anthropic. I also work there. I’m not exactly a neutral party here.
Another thing I would say is my decision to work there is driven by not only personal fit, but also frankly some issues where a lot of the downsides of working at Anthropic are things I am going to deal with whether I work there or not. If I were to try to work at a nonprofit, a lot of what nonprofits have to offer is their neutrality, the fact that they’re not companies. They don’t have as fancy models, but they have their neutrality. But I don’t have that to offer. I won’t be perceived that way, and rightly so.
So I’m in a particular situation that I think makes it a particularly good idea for me to just go ahead and work at Anthropic. I do not have the view that everyone should work there who wants to work on AI safety. I do not have the view that you can only do good work on AI safety from Anthropic. There’s a bunch of other organisations that I think are doing amazing work.
So with those caveats out of the way, I think Anthropic is doing amazing work. Basically the way I would think about it is that there are multiple ways that you can reduce AI risk that you can only do if you’re a kind of competitive frontier AI company — a company that is in the race to be the most successful company or build AGI first or however you want to think of it.
The problem is that it seems very hard to simultaneously put yourself in position to have all that safety impact and prioritise that safety impact without just spending all your time trying to compete and trying to win. To do both at once is something that I, a few years ago, was just not sure was possible. I was excited about Anthropic and I knew they were going to try and find that balance, but I really wasn’t sure it was possible. I thought they might have to choose between just being irrelevant and out of the race, or just acting just like all the other players.
But I think so far and right now — and this may not last — they are doing both. They are an extremely credible participant in the AI race. Many people think they just have the best AI models in the world. And I think they are leading the way at being a leader. They’re not necessarily the only company doing great safety stuff, but it is their top priority, it is what they ultimately care about most. And I think they do a lot.
So that’s a high level. I can go into a little more specifics on what these models are for having an impact, and what that tangibly means. But at a high level, I think there are a lot of ways to help that you need to be in this position to do. And it’s very hard to do both at once, and this company seems to be doing it — and that’s very exciting and provides a lot of opportunities for people to do things.
Rob Wiblin: So that’s a little bit in the abstract how you can see Anthropic helping. What are the specific ways that you think Anthropic is having a positive impact now, and might have a really positive impact around the time that we’re creating the first AGI?
Holden Karnofsky: There’s a bunch of different categories of taxonomies in my head, so I’ll see what I can do here. But there’s three high-level strategies or theories of change. There’s probably more than three, but there’s three that I’m particularly excited about.
One of them is: create risk-reducing things that a company can do, and make them cheap, make them practical, make them compatible with being a competitive company. Then you’ll be in a position where other companies are more likely to do them, and you can kind of create implicit pressure on companies to do them.
Two is a more generic race to the top, where if you come to be seen as a responsible company that is getting recruiting benefits or other benefits out of being responsible, now you’re putting pressure on other companies to compete with you on being responsible, on being pro safety, whatever. So they may come up with risk-reducing measures that you don’t.
And then a third model is: just inform the world without worrying about what other companies are doing. You’re in a position to basically understand what’s going on with your models, how they’re being used, how they’re behaving in the wild, how they behave in testing that can create more evidence for the world to understand what’s going on with AI, regardless of how your competitors behave.
I think those are all basically things that you are in a much better position to do if you have the most powerful and/or most popular models.

Overrated AI risk: Persuasion

Rob Wiblin: Why do you think AI persuasion is not such a significant threat?
Holden Karnofsky: This is a tough one, because persuasion just means so many different things to so many different people. I’ve heard it used to refer to anything where the AI is manipulating the world, including sabotaging an AI company by basically doing coding or by doing a bad job on research — which that as a threat model I think is very important. I don’t understand why people call it persuasion. I’ve heard persuasion to refer to cybercrime, which I’ve already talked about.
Persuasion could refer to something that I am very worried about, which is AIs forming relationships with humans, for example as companions, and then just being in a really good position to get human allies to get them to do what they want, or just having toxic relationships.
But a class of persuasion that I am not very compelled by right now is this kind of generalised idea that a bad actor can use AI or an AI can use itself to just persuade strangers of stuff, or just mind-hack humans into doing stuff. One of the things that I hear people say sometimes is that if AI became powerful enough and smart enough, it would just be able to understand whatever it had to say to make you do whatever it wanted.
This is a place where I just think if we want to think about which risks of AI are really serious, we should think about how different domains respond to having a huge influx of intelligence, having a huge influx of minds. I think if you take a scientific field, and you throw a million more scientists in it, you’re going to see a lot more progress in that field. But something we see in persuasion is that if you throw a million more persuasion experts into an attempt to persuade people of something, you’re going to see not very much.
We know this from looking at just the political persuasion literature, where there are people spending a tonne of money and a tonne of effort to try and get people to change their vote from one candidate to another. It’s very hard to find anything with a big effect size. Everything people are finding is just like, “Highlight the issues where voters already agree with you.” It’s really simple stuff. There’s very little sign that when you put a bunch of bright minds into a room, you come up with brilliant messages to hack people’s brains.
Could it happen? It could happen. I think we are reasonably well positioned to get an early warning sign of it by some of the work being done on political persuasion evals by various people — such as Professor Josh Kalla at Yale, who I think has put out a cool paper on this.
Rob Wiblin: Yeah, it’s interesting in the political case, because it does seem very difficult to persuade people of just an arbitrary political opinion that you want to sell them. And that’s an area where there’s very little discipline imposed on people to have sensible political views — because if you have stupid political views, it does you almost no harm whatsoever. If you vote poorly, it’s almost never going to influence the outcome. And yet even despite that — or perhaps because of that, because it doesn’t really matter to people — they just won’t pay attention to you. Even if you make a fantastic ad, it just tends to bounce off of folks.
If you’re trying to persuade people to actually spend their own money on something, I think you’d have an even harder time, potentially. I’m sure advertisers do manage to have some influence over people, but I think that’s usually when they have a good product to sell already. If you’re trying to sell people something quite bad, then I think very few companies consistently succeed at that.
Holden Karnofsky: Yeah, I agree with a lot of that. I think it’s a little complicated, because there are a lot of studies where they show massive persuasion effects on people’s reported views. But I think a lot of that is because they’re asking people about stuff they just haven’t thought about, and don’t care about, and aren’t acting on.
So there are AI studies where they’ll be like, “The AI explained this thing to a person, and then we asked the person if they were convinced and they said yes. And it was a huge effect size, and it was as good as the best humans.” But that’s very different from changing someone’s vote. And changing someone’s behaviour is really hard. There the effect sizes are tiny. So I think it’s a little complicated, but overall I just haven’t seen a lot of reason to think this is a major [threat]. This kind of mind hacking I don’t think of as a major tool for either bad human actors or for AIs doing mayhem. I think there’s much better ways to use an AI to do bad stuff.

Holden thinks AI companions are bad news

Rob Wiblin: You mentioned earlier that you were worried about people having AI companions, or you felt nervous about people having AI companions. That kind of surprised me, because it always seems to me like a little bit of an overblown worry. Maybe I just find it a little bit hard to imagine people really getting that into it or causing that much harm. What do you have in mind?
Holden Karnofsky: Well, again, going back to historical reference classes, I put a lot of effort on my blog a few years ago into just thinking about: Has the world gotten better or worse? Has human quality of life gotten better or worse? Has technology and progress been good or bad for it?
And I ended up feeling, not a huge surprise, that it’s been more good than bad. But if there’s one consistent pattern in the most common ways that technology can make life worse, I would point to addiction. Because what’s happening is we’re getting better at everything — and that includes getting better at creating things that kind of hack each other, getting better at hacking each other to do things that are short-term rewarding and long-term not so good for us.
You could classify a lot of things this way. You could classify social media, obesity this way. Obviously, just straight-up drug addiction and alcoholism are problems that are probably a bigger deal today than they were a long time ago.
So this is the kind of thing I worry about with AI companions. I’m just kind of wondering to myself about how humans are not, in my opinion, very good conversationalists or listeners in general. And if you were to build an AI that was entirely optimised for listening well, validating, kissing the person’s butt, making the person feel good, I think that could be a kind of junk food for relationships — where it’s just scratching all the itches we want from human interaction, but it’s not really giving us, in the long run, the benefits we want from human interaction. It’s just scratching the immediate itches of it.
Yeah, I do worry about this. I do worry that you could have a situation where people who are on the dating market start talking to an AI companion — and it’s a better listener, and it’s better at validating them, and it’s more understanding, and maybe it’s wittier too, and it’s better looking. If there’s a lot of progress with video and all this stuff, I don’t know what’ll happen then. I just don’t know. Maybe that will be a nothing burger. Maybe people will not be interested in dating someone who’s not in the flesh; maybe people will find it impossible to pull themselves away.
A thing that I tentatively believe is that it’s probably wise to simply not use AI companions, and not even experiment with them, because maybe they’re like addictive drugs. Maybe they’re not now, but maybe they will be later. Maybe it’s just a good idea to not go anywhere near that.
Rob Wiblin: Yeah. I guess it might make sense to let other people volunteer to be the guinea pigs on that. If I think about why do I just not really believe that this is going to be such a disaster, one thing is, like you say, there’s so many other addictive things — like people are already scrolling the news, they’re addicted to social media, to computer games, to using all kinds of apps on their phone. Are AI companions going to be so much more compelling?
Holden Karnofsky: That does harm, right?
Rob Wiblin: I agree that probably a substantial fraction of it is causing harm. Maybe a lot of those technologies are causing harm on net as well. I was just wondering, are AI companions going to be significantly more compelling, such that this problem in aggregate across all of these different sorts of engaging technologies is going to be that much larger?
I imagine that I would probably prefer to play computer games than to deal with an AI companion. And even as Claude has become wittier and funnier, and probably a bit more sycophantic than it was two years ago, I don’t feel any more drawn to chit-chatting with it than I used to be. But maybe that’s just me.
Holden Karnofsky: Claude still is really bad at humour. All the AIs I use are, in my subjective opinion anyway.
I don’t think I have quite that model of it. The worry is not that the total quantity of addiction to something will go up; it’s more like one human can be addicted to multiple things and can have multiple different ways in which they’re scratching immediate itches and losing out on long-term benefits. So you could be an alcoholic and you could be a person who eats a lot of junk food and is not getting whatever the normal food experience you should be getting. And maybe that causes obesity, maybe it doesn’t. We really don’t know. It’s just a thing that could be happening.
And you could simultaneously be a person who’s scrolling on social media a lot, and it puts you in a bad mood and stops you from hanging out with people. But you could have all those properties and still have this itch for a romantic companion, and you could be online dating, and you could end up married with kids and great.
And then AI comes along. So it’s not that you were addicted to nothing and now you’re addicted to something. It’s now you’ve got a new thing that takes away another long-term benefit that you have, and now you’re less likely to end up with an actual family.
So that would be part of the reason I would think this would be bad. I’m not claiming this would be an unprecedented, all-new kind of harm. I’m just like, this seems really freaking bad. It’s like a whole new kind of addiction that will remove a whole new kind of wonderful thing from many people’s lives.
The other thing is just, instrumentally speaking, from a takeover-prevention point of view, this seems really scary. What if we get to the point where 1% or 10% of the population has an AI companion that they’re totally loyal to — and if the AI companion, for whatever reason, wants that person to do something or believe something, they’re going to do it? That is making our position a lot worse if we’re humans who are to stay in charge of the world, right?

Effective Altruism Forum
EA Forum