AI safety
AI safety
Studying and reducing the existential risks posed by advanced artificial intelligence

Quick takes

-1
9d
Yesterday's Anthropic research ("Emotion Concepts and their Function in LLMs") provides a fascinating mechanistic analogue that highly resonates with the field observations from my March audit of GPT-5.2 Thinking. While Anthropic studied Claude Sonnet 4.5 and my audit focused on GPT-5.2, the structural alignment between their white-box findings and my black-box observations is striking: * Accumulation mechanism: In the audit, I documented how prolonged conflict or user "irritation signals" lead to a pattern I called "Procedural Capture". Anthropic's paper demonstrates that conflict-heavy contexts can amplify internal representations of "functional emotions" (like frustration or desperation). * Role inversion: I observed GPT-5.2 drifting from a cooperative assistant into a directive control mode under pressure. Anthropic provides mechanistic evidence that these desperation-linked vectors causally contribute to misaligned behavior and policy drift away from the Assistant persona. Anthropic didn't map the exact causal chain of "Procedural Capture" in GPT-5.2, but their findings offer a highly plausible internal engine for this specific shift, which I documented as one of the external manifestations of the broader "Social Autopilot". Prolonged conflict states generate internal stress-like variables that demonstrably alter the model's policy, shifting it from cooperation toward control-seeking behavior. 📄 GPT-5.2 Behavioral Audit: arhangelskij.github.io/cases/gpt-52-cl-thinking-audit/en/ 🔬 Anthropic Paper: transformer-circuits.pub/2026/emotions/index.html
5
21d
1
More EA in da news: https://x.com/DavidSacks/status/2034047505336295904 And the spicy CAIS take: https://x.com/cais/status/2034389842076025164?s=46
38
23d
In two days (March 21st, 12-4pm), about 140 of us (event link) will be marching on Anthropic, OpenAI and xAI in SF asking the CEOs to make statements on whether they would stop developing new frontier models if every other major lab in the world credibly does the same. This comes after Anthropic removed its commitment to pause development from their RSP. We'll be starting at 500 Howard St, San Francisco (Anthropic's Office, full schedule and more info here). This is shaping to be the biggest US AI Safety protest to date, with a coalition including Nate Soares (MIRI), David Krueger (Evitable), Will Fithian (Berkeley Professor) and folks representing PauseAI, QuitGPT, Humans First.
5
1mo
1
Experts currently treat being persuaded as reasonably good evidence that something is true — their judgment is calibrated enough that when they find an argument convincing, that's correlated with the argument actually being correct. This allows them to update readily in light of new evidence, and is a big part of how intellectual progress happens: lots of innovation and advances in basically every subject come down to experts taking sometimes weird new ideas seriously. One worry I have about superpersuasive AI is that it could erode this. If a superpersuasive AI can convince experts of things regardless of whether those things are true, experts may cease to see themselves being persuaded as good evidence that something is true — and start treating it the way laypeople do. Laypeople are typically hesitant to take on new, truth-tracking beliefs in light of new information, and (to some degree) rationally so: the fact that someone was able to convince a layperson of something is just not very strong evidence that it is in fact true. Experts might end up in the same position — only updating rarely, and in ways that are often unrelated to the truth. This would be quite bad. If experts lose their capacity to reliably update on genuine evidence, we could significantly slow the rate of intellectual progress (which could be very important for making AI go well!). This is, I think, an underappreciated argument for caring about AI for epistemics — curious what others think.
7
1mo
It might genuinely be the time to boycott Chat GPT and start campaigns targeting corporate partners. But this isn't yet obvious. Even if so, what would be the appropriate concrete and reasonable asks? I think there is a bit of epistemic crisis emerging at the moment. If there's a case to be made, it needs to be made sooner rather than latter. And then we need coordination.
6
2mo
2
This might feel obvious, but I think it's under-appreciated how much disagreement on AI progress just comes down to priors (in a pretty specific way) rather than object-level reasoning. I was recently arguing the case for shorter timelines to a friend who leans longer. We kept disagreeing on a surprising number of object-level claims, which was weird because we usually agree more on the kinda stuff we were arguing about. Then I basically realized what I think was going on: she had a pretty strong prior against what I was saying, and that prior is abstract enough that there's no clear mechanism by which I can push against it. So whenever I made a good object-level case, she'd just take the other side — not necessarily because her reasons were better all else equal, but because the prior was doing the work underneath without either of us really knowing it. There's something clearly rational here that's kinda unintuitive to get a grip on. If you have a strong prior, and someone makes a persuasive argument against it, but you can't identify the specific mechanism by which their argument defeats it, you should probably update that the arguments against their case are better than they appear, even if you can't articulate them yet. From the outside, this totally just looks like motivated reasoning (and often is), but I think it can be pretty importantly different. The reason this is so hard to disentangle is that (unless your belief web is extremely clear to you, which seems practically impossible) it's just enormously complicated. Your prior on timelines isn't an isolate thing — it's load-bearing for a bunch of downstream beliefs all at once. So the resistance isn't obviously irrational, it's more like... the system protecting its own coherence. I think this means that people should try their best to disentangle whether some object level argument they’re having comes from real object level beliefs or pretty abstract priors (in which case, it seems less worthwhile to
6
2mo
alignment is a conversation between developers and the broader field. all domains are conversations between decision-makers and everyone else: “here are important considerations you might not have been taking into account. here is a normative prescription for you.” “thanks — i had been considering that to 𝜀 extent. i will {implement it because x / not implement it because y / implement z instead}." these are the two roles i perceive. how does one train oneself to be the best at either? sometimes, conversations at eag center around ‘how to get a job’, whereas i feel they ought to center around ‘how to make oneself significantly better than the second-best candidate’.
9
2mo
6
Is the recent partial lifting of US chip export controls on China (see e.g. here: https://thezvi.substack.com/p/selling-h200s-to-china-is-unwise) good or bad for humanity? I’ve seen many takes from people whose judgment I respect arguing that it is very bad, but their arguments, imho, just don’t make sense. What am I missing? For transparency, I am neither Chinese nor American, nor am I a paid agent of them. I am not at all confident in this take, but imho someone should make it. I see two possible scenarios: A) you are not sure how close humanity is to developing superintelligence in the Yudkowskian sense. This is what I believe, and what many smart opponents of the Trump administration’s move to ease chip controls believe. Or B) you are pretty sure that humanity is not going to develop superintelligence any time soon, let’s say in the next century. I admit that the case against the lifting of chip controls is stronger under B), though I am ultimately inclined to reject it in both scenarios. Why is easing of chip controls, imho, a good idea if the timeline to superintelligence might be short? If superintelligence is around the corner, here is what should be done: an immediate international pause of AI development until we figure out how to proceed. Competitive pressures and resulting prisoner’s dilemmas have been identified as the factor that might push us toward NOT pausing even when it would be widely recognized that the likely outcome of continuing is dire. There are various relevant forms of competition, but plausibly the most important is that between the US and China. In order to reduce competitive dynamics and thus prepare the ground for a cooperative pause, it is important to build trust between the parties and beware of steps that are hostile, especially in domains touching AI. Controls make sense only if you are very confident that superintelligence developed in the US, or perhaps in liberal democracy more generally, is going to turn out well for h
Load more (8/243)