SummaryBot

1047 karmaJoined Aug 2023

Bio

This account is used by the EA Forum Team to publish summaries of posts.

Comments
1560

We read every labs safety plan so you don't have to: 2025 edition

Executive summary: An evidence-based comparative analysis of Anthropic’s Responsible Scaling Policy, Google DeepMind’s Frontier Safety Framework, and OpenAI’s Preparedness Framework (all updated in 2025) finds broadly similar, misuse-focused approaches to monitoring dangerous capabilities (bio/chem, cyber, and AI self-improvement) but highlights weakening commitments, governance differences, and persistent vagueness about concrete “if-then” actions—leaving substantial uncertainty about whether these policies would prevent catastrophic outcomes.

Key points:

Common architecture, different labels: All three frameworks commit to testing for dangerous capabilities and gating deployment behind safeguards; they track broadly the same areas (CBRN/bio-chem, cyber, and AI self-improvement), emphasize misuse over misalignment, and use threshold concepts (Anthropic “Capability Thresholds”/ASLs, DeepMind CCLs, OpenAI high/critical risk tiers).
How risks are evaluated: Anthropic triggers comprehensive assessments after step-change indicators and tests “safety-off” variants; DeepMind runs Early Warning Evaluations with alert thresholds and brings in external experts; OpenAI relies on scalable automated proxies validated by deep-dive red-teaming and domain tests.
What happens at the thresholds: Anthropic pairs thresholds with ASL-3/4 deployment and security safeguards plus executive/board signoffs; DeepMind requires a governance-approved “safety case” and RAND-style security levels but is explicit that some measures need field-wide coordination; OpenAI allows deployment of “high-risk” models only with safeguards and pledges to pause training for “critical-risk” models.
Governance and posture differences: Anthropic foregrounds internal roles, whistleblowing, and public capability reports; DeepMind spreads authority across multiple councils and stresses industry co-adoption; OpenAI routes decisions through a Safety Advisory Group and board committee, with a notable (but high-level) training-pause commitment.
2025 regressions and recalibrations: Labs added process detail but also softened parts of earlier commitments—e.g., conditional adoption tied to competitors, reduced safeguards for some CBRN/cyber cases, OpenAI removing “persuasion” from its tracked categories, and Anthropic stepping back from pre-defining ASL-N+1 evaluations—raising doubts about robustness under competitive pressure.
Unresolved crux: will this avert catastrophe? The documents remain more specification-plus-tests than operational plans with hard triggers; senior leaders’ stated P(doom) still diverge markedly (e.g., ~25% vs. ~2%), underscoring real uncertainty about whether these frameworks, even if followed, meaningfully reduce existential risk and suggesting a need for stronger, coordinated standards and regulation.