M

MattJ

Fundraiser and social profit advocate
2 karmaJoined Working (6-15 years)

Comments
5

Yarrow, thank you for this sharp and clarifying discussion.

You have completely convinced me that my earlier arguments from "investment as a signal" or "LHC/Pascal's Wager" were unrigorous, and I concede those points.

I think I can now articulate my one, non-speculative crux.

The "so what" of Toby Ord's (excellent) analysis is that it provides a perfect, rigorous, "hindsight" view of the last paradigm—what I've been calling "Phase 1" RL for alignment.

My core uncertainty isn't speculative "what-if" hope. It's that the empirical ground is shifting.

The very recent papers we discussed (Khatri et al. on the "art" of scaling, and Tan et al. on math reasoning) are, for me, the first public, rigorous evidence for a "Phase 2" capability paradigm.

• They provide a causal mechanism for why the old, simple scaling data may be an unreliable predictor.

• They show this "Phase 2" regime is different: it's not a simple power law but a complex, recipe-dependent "know-how" problem (Khatri), and it has different efficiency dynamics (Tan).

This, for me, is the action-relevant dilemma.

We are no longer in a state of "pure speculation". We are in a state of grounded, empirical uncertainty where the public research is just now documenting a new, more complex scaling regime that the private labs have been pursuing in secret.

Given that the lead time for any serious safety work is measured in years, and the nature of the breakthrough is a proprietary, secret "recipe," the "wait for public proof" strategy seems non-robust.

That's the core of my concern. I'm now much clearer on the crux of the argument, and I can't thank you enough for pushing me to be more rigorous. This has been incredibly helpful, and I'll leave it there.


 

Yarrow, these are fantastic, sharp questions. Your “already accounted for” point is the strongest counter-argument I’ve encountered.

You’re correct in your interpretation of the terms. And your core challenge—if LLM reward models and verifiable domains have existed for ~3 years, shouldn’t their impact already be visible?—is exactly what I’m grappling with.

Let me try to articulate my hypothesis more precisely:

The Phase 1 vs Phase 2 distinction:

I wonder if we’re potentially conflating two different uses of RL that might have very different efficiency profiles:

1. Phase 1 (Alignment/Style): This is the RLHF that created ChatGPT—steering a pretrained model to be helpful/harmless. This has been done for ~3 years and is probably what’s reflected in public benchmark data.

2. Phase 2 (Capability Gains): This is using RL to make models fundamentally more capable at tasks through extended reasoning or self-play (e.g., o1, AlphaGo-style approaches).

My uncertainty is: could “Phase 2” RL have very different efficiency characteristics than “Phase 1”?

Recent academic evidence:

Some very recent papers seem to directly address this question:

• A paper by Khatri et al., "The Art of Scaling Reinforcement Learning Compute for LLMs" (arXiv: 2510.13786), appears to show that simple RL methods do hit hard performance ceilings (validating your skepticism), but that scaling RL is a complex “art.” It suggests a specific recipe (ScaleRL) can achieve predictable scaling. This hints the bottleneck might be “know-how” rather than a fundamental limit.

• Another paper by Tan et al., "Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning" (arXiv: 2509.25300), on scaling RL for math found that performance is more bound by data quality (like from verifiable domains) than just compute, and that larger models are more compute- and sample-efficient at these tasks.

Why this seems relevant:

This research suggests “Phase 1” RL (simple, public methods) and “Phase 2” RL (complex recipes, high-quality data, large models) might have quite different scaling properties.

This makes me wonder if the scaling properties from prior RL research might not fully capture what’s possible in this new regime: very large models + high-quality verifiable domains + substantial compute + the right training recipe. Prior research isn’t irrelevant, but perhaps extrapolation from it is unreliable when the conditions are changing this much?

If labs have found (or are close to finding) these “secret recipes” for scalable RL, that could explain continued capital investment from well-informed actors despite public data showing plateaus.

The action-relevant dilemma:

Even granting the epistemic uncertainty, there seems to be a strategic question: Given long lead times for safety research, should researchers hedge by preparing for RL efficiency improvements, even if we can’t confidently predict them?

The asymmetry: if we wait for public evidence before starting safety work, and RL does become substantially more efficient (because a lab finds the right “recipe”), we’ll have even less lead time. But if we prepare unnecessarily, we’ve misallocated resources.

I don’t have a clean answer to what probability threshold for a potential breakthrough justifies heightened precautionary work. But the epistemic uncertainty itself—combined with some papers suggesting the scaling regime might be fundamentally different than assumed—makes me worry whether we’re evaluating the efficiency of propellers while jet engines are being invented in private.

Does this change your analysis at all, or do you think the burden of proof still requires more than theoretical papers about potential scaling regimes?
 


 

YB, thank you for the pushback. You’ve absolutely convinced me that my “science vs. engineering” analogy was unrigorous, and your core point about extrapolating a trend by assuming a new causal factor will appear is the correct null hypothesis to hold.


What I’m still trying to reconcile, specifically regarding RL efficiency improvements, is a tension between what we can observe and what may be hidden from view.


I expect Toby’s calculations are 100% correct. Your case is also rigorous and evidence-based: RL has been studied for decades, PPO (2017) was incremental, and we shouldn’t assume 10x-100x efficiency gains without evidence. The burden of proof is on those claiming breakthroughs are coming.
But RL research seems particularly subject to information asymmetry:
    •    Labs have strong incentives to keep RL improvements proprietary (competitive advantage in RLHF, o1-style reasoning, agent training)
    •    Negative results rarely get published (we don’t know what hasn’t worked)
    •    The gap between “internal experiments” and “public disclosure” may be especially long for RL
We’ve seen this pattern before - AlphaGo’s multi-year information lag, GPT-4’s ~7-month gap. But for RL specifically, the opacity seems greater. OpenAI uses RL for o1, but we don’t know their techniques, efficiency gains, or scaling properties. DeepMind’s work on RL is similarly opaque.


This leaves me uncertain about future RL scaling specifically. On one hand, you’re right that decades of research suggest efficiency improvements are hard. On the other hand, recent factors (LLMs as reward models, verifiable domains for self-play, unprecedented compute for experiments) combined with information asymmetry make me wonder if we’re reasoning from incomplete data.


The specific question: Does the combination of (a) new factors like LLMs/verifiable domains, plus (b) the opacity and volume of RL research at frontier labs, warrant updating our priors on RL efficiency? Or is this still the same “hand-waving” trap - just assuming hidden progress exists because we expect the trend to continue?


On the action-relevant side: if RL efficiency improvements would enable significantly more capable agents or self-improvement, should safety researchers prepare for that scenario despite epistemic uncertainty? The lead times for safety work seem long enough that “wait and see” may not be viable.
For falsifiability: we should know within 18-24 months. If RL-based systems (agents, reasoners) don’t show substantial capability gains despite continued investment, that would validate skepticism. If they do, it would suggest there were efficiency improvements we couldn’t see from outside.


I’m genuinely uncertain here and would value a better sense of whether the information asymmetry around RL research specifically changes how we should weigh the available evidence?

Thank you, Toby et al., for this characteristically clear and compelling analysis and discussion. The argument that RL scaling is breathtakingly inefficient and may be hitting a hard limit is a crucial consideration for timelines.

This post made me think about the nature of this bottleneck, and I'm curious to get the forum's thoughts on a high-level analogy. I'm not an ML researcher, so I’m offering this with low confidence, but it seems to me there are at least two different "types" of hard problems.

1. A Science Bottleneck (Fusion Power): Here, the barrier appears to be fundamental physics. We need to contain a plasma that is inherently unstable at temperatures hotter than the sun. Despite decades of massive investment and brilliant minds, we can't easily change the underlying laws of physics that make this so difficult. Progress is slow, and incentives alone can't force a breakthrough.

2. An Engineering Bottleneck (Manhattan Project): Here, the core scientific principle was known (nuclear fission). The barrier was a set of unprecedented engineering challenges: how to enrich enough uranium, how to build a stable reactor, etc. The solution, driven by immense incentives, was a brute-force, parallel search for any viable engineering path (e.g., pursuing gaseous diffusion, electromagnetic separation, and plutonium production all at once).

This brings me back to the RL scaling issue. I'm wondering which category this bottleneck falls into.

From the outside, it feels more like an engineering or "Manhattan Project" problem. The core scientific discovery (the Transformer architecture, the general scaling paradigm) seems to be in place. The bottleneck Ord identifies is that one specific method (RL - likely PPO based) is significantly compute-inefficient and hard to continue scaling.

But the massive commercial incentives at frontier labs aren't just to make this one inefficient method 1,000 or 1,000,000x bigger. The incentive is to invent new, more efficient methods to achieve the same goal or similar.

We've already seen a small-scale example of this with the rapid shift from complex RLHF to the more efficient Direct Preference Optimization (DPO). This suggests the problem may not be a fundamental "we can't continue to improve models" barrier, but an engineering one: "this way of improving models is too expensive and unstable."

If this analogy holds, it seems plausible that the proprietary work at the frontier isn't just grinding on the inefficient RL problem, but is in a "Manhattan"-style race to find a new algorithm or architecture that bypasses this specific bottleneck.

This perspective makes me less confident that this particular bottleneck will be the one that indefinitely pushes out timelines, as it seems like exactly the kind of challenge that massive, concentrated incentives are historically good at solving.

I could be completely mischaracterizing the nature of the challenge, though, and still feel quite uncertain. I'd be very interested to hear from those with more technical expertise if this framing seems at all relevant or if the RL bottleneck is, in fact, closer to a fundamental science or "Fusion" problem.


 

The “depopulation bad” framing - while helpful for engagement - misses key longtermist concerns in my opinion. The real question isn’t just how many people exist—but whether humanity (and other life) can flourish sustainably within planetary boundaries.

We’re already in ecological overshoot, degrading biosphere systems essential to all sentient life. Climate change is just one facet of a more complex set of systems facing challenges. A smaller, well-supported population—achieved via voluntary, rights-based policies—could reduce existential risk by stabilizing Earth’s life-support systems, supporting biodiversity, and improving welfare per capita.

Yes, demographic decline poses economic and institutional challenges. But these are solvable. Civilizational collapse from ecological breakdown is not.

Optimizing for total population without sufficient ecological resilience risks long-term value. We should aim for a population trajectory that preserves planetary habitability over the long run.

Thanks for the discussion!

M