Hide table of contents

This is the latest in a series of essays on AI Scaling. 
You can find the others on my site.

Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety.

 

The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types of scaling:

  1. Scaling the amount of compute used for RL during training
  2. Scaling the amount of compute used for inference during deployment

We can see (1) as training the AI in more effective reasoning techniques and (2) as allowing the model to think for longer. I’ll call the first RL-scaling, and the second inference-scaling. Both new kinds of scaling were present all the way back in OpenAI’s announcement of their first reasoning model, o1, when they showed this famous chart:

I’ve previously shown that in the initial move from a base-model to a reasoning model, most of the performance gain came from unlocking the inference-scaling. The RL training did provide a notable boost to performance, even holding the number of tokens in the chain of thought fixed. You can see this RL boost in the chart below as the small blue arrow on the left that takes the base model up to the trend-line for the reasoning model. But this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost.

The question of where these capability gains come from is important because scaling up the inference compute has very different implications than scaling up the training compute. In this first round of reasoning models, they were trained with a very small amount of RL compute compared to the compute used in pre-training, meaning that the total cost of training was something like 1.01x higher than the base-model. But if most of the headline performance results require 30x as much inference compute, then the costs of deploying the those capabilities is 30x higher. Since frontier AI developers are already spending more money deploying their models than they did training them, multiplying those costs by 30x is a big deal. Moreover, these are costs that have to be paid every time you want to use the model at this level of capability, so can’t be made up in volume. 

But that was just the initial application of RL to LLMs. What happens as companies create more advanced reasoning models, using more RL?

The seeds of the answer can be found all the way back in that original o1 chart.

The chart shows steady improvements for both RL-scaling and inference-scaling, but they are not the same. Both graphs have the same y-axis and (despite the numbers being removed from the x-axis) we can see that they are both on a logarithmic x-axis covering almost exactly two orders of magnitude of scaling (100x). In both cases, the datapoints lie on a relatively straight line, which is presumably the central part of a larger S-curve. However, the slope of the RL-scaling graph (on the left) is almost exactly half that of the slope of the inference-scaling graph (on the right). When the x-axis is logarithmic, this has dramatic consequences.

The graph on the right shows that scaling inference-compute by 100x is enough to drive performance from roughly 20% to 80% on the AIME benchmark. This is pretty typical for inference scaling, where quite a variety of different models and benchmarks see performance improve from 20% to 80% when inference is scaled by 100x. 

For instance, this is what was found with Anthropic’s first reasoning model (Sonnet 3.7) on another AIME benchmark, with almost exactly the same scaling behaviour:

And ability on the ARC-AGI 1 benchmark also scales in a similar way for many of OpenAI’s different reasoning models:

We don’t always see this scaling behaviour for inference: some combinations of LLM, inference-scaling technique, and benchmark see the performance plateau below 80% or exhibit a different slope (often worse). But this climb from 20 to 80 with 100x more inference compute is pretty common (especially for reasoning-intensive benchmarks) and almost certainly what is happening on that original o1 graph.

In contrast, the slope of the RL-scaling trend is half as large, which means that it requires twice as many orders of magnitude to achieve the exact same improvement in capabilities. Increasing the RL training compute by 100x as shown in the o1 chart only improved performance from about 33% to 66%. At that rate, going from 20 to 80 would require scaling up the RL training compute by 10,000x.

We can confirm this trend — and that it continued beyond o1 — by looking at the following graph from the o3 launch video (with a line added showing the slope corresponding to going from 20 to 80 in 10,000x):

Using another version of the AIME benchmark, this shows o1’s training progress over 3 orders of magnitude and o3’s training over a further order of magnitude. In total, we see that scaling up the RL-training by 4 orders of magnitude takes the model from about 26% to 88%. This provides some confirmation for the rule-of-thumb that a 10,000x scale-up in RL training compute is required to improve this benchmark performance from 20 to 80.

To my knowledge, OpenAI hasn’t provided RL-training curves for other benchmarks, but they do have charts comparing o1 with o3 and o3 with GPT-5 at different inference-scaling levels on several benchmarks. Given that o3 used about 10x as much RL training as o1, we’d expect the RL boost going from o1 to o3 to be worth about the same as the inference boost of giving o1 just half an order of magnitude more inference (~3x as many tokens). And this is indeed what one sees on their performance/token graph comparing the two:

Similarly, o3 also requires about 3x as many tokens to match GPT-5 on the SWE-bench and GPQA Diamond benchmarks. This would fit the expected pattern of GPT-5 having been trained with a further 10x as much RL training compute as o3:

It is hard to verify that this trend holds for models from other companies, as this data on training curves for cutting-edge models is often treated as confidential. But the fact that other leading labs’ base models and reasoning models are roughly on par with OpenAI’s suggests none of them are scaling notably better than this.

So the evidence on RL-scaling and inference-scaling supports a general pattern:

  • a 10x scaling of RL is required to get the same performance boost as a 3x scaling of inference
  • a 10,000x scaling of RL is required to get the same performance boost as a 100x scaling of inference

In general, to get the same benefit from RL-scaling as from inference-scaling required twice as many orders of magnitude. That’s not good.

How do these compare to pre-training scaling?

The jumps from GPT-1 to 2 to 3 to 4 each involved scaling up the pre-training compute by about 100x. How much of the RL-scaling or inference-scaling would be required to give a similar boost? While I can’t say for sure, we can put together the clues we have and take an educated guess.

Jones (2021) and EpochAI both estimate that you need to scale-up inference by roughly 1,000x to reach the same capability you’d get from a 100x scale-up of training. And since the evidence from o1 and o3 suggests we need about twice as many orders of magnitude of RL-scaling compared with inference-scaling, this implies we need something like a 1,000,000x scale-up of total RL compute to give a boost similar to a GPT level.

This is breathtakingly inefficient scaling. But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.

Yet despite the poor scaling behaviour, RL training has so far been a good deal. This is solely because the scaling of RL compute began from such a small base compared with the massive amount of pre-training compute invested in today’s models. While AI labs are reticent to share information about how much compute has actually been spent on RL (witness the removal of all numbers from the twin o1 scaling graphs), it is widely believed that even the 10,000x RL-scaling we saw for o3’s training still ended up using much less compute than the ~ FLOP spent on pre-training. This means that OpenAI (and their competitors) have effectively got those early gains from RL-training for free. 

For example, if the 10x scaling of RL compute from o1 to o3 took them from a total of 1.01x the pre-training compute to 1.1x, then the 10x scale-up came at the price of a 1.1x scale-up in overall training costs. If that gives the same performance boost as using 3x as many reasoning tokens (which would multiply all deployment costs of reasoning models by 3) then it is a great deal for a company that deploys its model so widely.

But this changes dramatically once RL-training reaches and then exceeds the size of the pre-training compute. In July 2025, xAI’s Grok 4 launch video included a chart suggesting that they had reached this level (where pre-training compute is shown in white and RL-training compute in orange):

Scaling RL by another 10x beyond this point increases the total training compute by 5.5x, and beyond that it is basically the full 10x increase to all training costs. So this is the point where the fact that they get much less for a 10x scale-up of RL compute compared with 10x scale-ups in pre-training or inference really bites. I estimate that at the time of writing (Oct 2025), we’ve already seen something like a 1,000,000x scale-up in RL training and it required ≤2x the total training cost. But the next 1,000,000x scale-up would require 1,000,000x the total training cost, which is not possible in the foreseeable future.

Grok 4 was trained on 200,000 GPUs located in xAI’s vast Colossus datacenter. To achieve the equivalent of a GPT-level jump through RL would (according to the rough scaling relationships above) require 1,000,000x the total training compute. To put that in perspective, it would require replacing every GPU in their datacenter with 5 entirely new datacenters of the same size, then using 5 years worth of the entire world’s electricity production to train the model. So it looks infeasible for further scaling of RL-training compute to give even a single GPT-level boost.

I don’t think OpenAI, Google, or Anthropic have quite reached the point where RL training compute matches the pre-training compute. But they are probably not far off. So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.

Conclusion

The shift towards RL allowed the scaling era to continue even after pre-training scaling had stalled. It did so via two different mechanisms: scaling up the RL training compute and scaling up the inference compute. 

Scaling RL training allowed the model to learn for itself how to achieve better performance. Unlike the imitation learning of next-token-prediction, RL training has a track record of allowing systems to burst through the human level — finding new ways of solving problems that go beyond its training data. But in the context of LLMs, it scales poorly. We’ve seen impressive gains, but these were only viable when starting from such a low base. We have reached the point where it is too expensive to go much further. 

This leaves us with inference-scaling as the remaining form of compute-scaling. RL helped enable inference-scaling via longer chain of thought and, when it comes to LLMs, that may be its most important legacy. But inference-scaling has very different dynamics to scaling up the training compute. For one thing, it scales up the flow of ongoing costs instead of scaling the one-off training cost. This has many consequences for AI deployment, AI risk, and AI governance

But perhaps more importantly, inference-scaling is really a way of improving capabilities by allowing the model more time to solve the problem, rather than by increasing its intelligence. Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.

Comments31
Sorted by Click to highlight new comments since:

While I appreciate this work being done, it seems a very bad sign for our world/timeline that the very few people with both philosophy training and an interest in AI x-safety are using their time/talent to do forecasting (or other) work instead of solving philosophical problems in AI x-safety, with Daniel Kokotajlo being another prominent example.

This implies one of two things: Either they are miscalculating the best way to spend their time, which indicates bad reasoning or intuitions even among humanity's top philosophers (i.e., those who have at least realized the importance of AI x-risk and are trying to do something about it). Or they actually are the best people (in a comparative advantage sense) available to work on these other problems, in which case the world must be on fire, and they're having to delay working on extremely urgent problems that they were trained for, to put out even bigger fires.

(Cross-posted to LW and EAF.)

Why do you think this work has less value than solving philosophical problems in AI safety? If LLM scaling is sputtering out, isn't that important to know? In fact, isn't it a strong contender for the most important fact about AI that could be known right now? 

I suppose you could ask why this work hasn't been done by somebody else already and that's a really good question. For instance, why didn't anyone doing equity research or AI journalism notice this already? 

Among people who are highly concerned about near-term AGI, I don't really expect such insights to be surfaced. There is strong confirmation bias. People tend to look for confirmation that AGI is coming soon and not for evidence against. So, I'm not surprised that Toby Ord is the first person within effective altruism to notice this. Most people aren't looking. But this doesn't explain why equity research analysts, AI journalists, or others who are interested in LLM scaling (such as AI researchers or engineers not working for one of the major LLM companies and not bound by an NDA) missed this. I am surprised an academic philosopher is the first person to notice this! And kudos to him for that!

Why do you think this work has less value than solving philosophical problems in AI safety?

From the perspective of comparative advantage and counterfactual impact, this work does not seem to require philosophical training. It seems to be straightforward empirical research, that many people could do, besides the very few professionally trained AI-risk-concerned philosophers that humanity has.

To put it another way, I'm not sure that Toby was wrong to work on this, but if he was, it's because if he hadn't, then someone else with more comparative advantage for working on this problem (due to lacking training or talent for philosophy) would have done so shortly afterwards.

I'm not sure that Toby was wrong to work on this, but if he was, it's because if he hadn't, then someone else with more comparative advantage for working on this problem (due to lacking training or talent for philosophy) would have done so shortly afterwards.

How shortly? We're discussing this in October 2025. What's the newest piece of data that Toby's analysis is dependent on? Maybe the Grok 4 chart from July 2025? Or possibly qualitative impressions from the GPT-5 launch in August 2025? Who else is doing high-quality analysis of this kind and publishing it, even using older data? 

I guess I don't automatically buy the idea that even in a few months we'll see someone else independently go through the same reasoning steps as this post and independently come to the same conclusion. But there are plenty of people who could, in theory, do it and who are, in theory, motivated to do this kind of analysis and who also will probably not see this post (e.g. equity research analysts, journalists covering AI, AI researchers and engineers independent of LLM companies). 

I certainly don't buy the idea that if Toby hadn't done this analysis, then someone else in effective altruism would have done it. I don't see anybody else in effective altruism doing similar analysis. (I chalk that up largely to confirmation bias.)

I appreciate you raising this Wei (and Yarrow's responses too). They both echoed a lot of my internal debate on this. I'm definitely not sure whether this is the best use of my time. At the moment, my research time is roughly evenly split between this thread of essays on AI scaling and more philosophical work connected to longtermism, existential risk and post-AGI governance. The former is much easier to demonstrate forward progress and there is more of a demand signal for it. The latter is harder to be sure it is on the right path and is in less demand. My suspicion is that it is generally more important though, and that demand/appreciation doesn't track importance very well.

It is puzzling to me too that no-one else was doing this kind of work on understanding scaling. I think I must be adding some rare ingredient, but I can't think of anything rare enough to really explain why no-one else got these results first. (People at the labs probably worked out a large fraction of this, but I still don't understand why the people not at the labs didn't.)

In addition to the general questions about which strand is more important, there are a few more considerations:

  • No-one can tell ex ante how a piece of work or research stream will pan out, so everyone will always be wrong ex post sometimes in their prioritisation decisions
  • My day job is at Oxford University's AI Governance Initiative (a great place!) and I need to be producing some legible research that an appreciable number of other people are finding useful
  • I'm vastly more effective at work when I have an angle of attack and a drive to write up the results — recently this has been for these bite-size pieces of understanding AI scaling. The fact that there is a lot of response from others is helping with this as each piece receives some pushback that leads me to the next piece.

But I've often found your (Wei Dai's) comments over the last 15-or-so years to be interesting, unusual, and insightful. So I'll definitely take into account your expressed demand for more philosophical work and will look through those pages of philosophical questions you linked to.

Do you have any insights into why there are so few philosophers working in AI alignment, or closely with alignment researchers? (Amanda Askell is the only one I know.) Do you think this is actually a reasonable state of affairs (i.e., it's right or fine that almost no professional philosophers work directly as or with alignment researchers), or is this wrong/suboptimal, caused by some kind of cultural or structural problem? It's been 6 years since I wrote Problems in AI Alignment that philosophers could potentially contribute to and I've gotten a few comments from philosophers saying they found the list helpful or that they'll think about working on some of the problems, but I'm not aware of any concrete follow-ups.

If it is some kind of cultural or structural problem, it might be even higher leverage to work on solving that, instead of object level philosophical problems. I'd try to do this myself, but as an outsider to academic philosophy and also very far from any organizations who might potentially hire philosophers to work on AI alignment, it's hard for me to even observe what the problem might be.

Re 99% of academic philosophers, they are doing their own thing and have not heard of these possibilities and wouldn't be likely to move away from their existing areas if they had. Getting someone to change their life's work is not easy and usually requires hours of engagement to have a chance. It is especially hard to change what people work on in a field when you are outside that field.

A different question is about the much smaller number of philosophers who engage with EA and/or AI safety (there are maybe 50 of these). Some of these are working on some of those topics you mention. e.g. Will MacAskill and Joe Carlsmith have worked on several of these. I think some have given up philosophy to work on other things such as AI alignment. I've done occasional bits of work related to a few of these (e.g. here on dealing with infinities arising in decision theory and ethics without discounting) and also to other key philosophical questions that aren't on your list.

For such philosophers, I think it is a mixture of not having seen your list and not being convinced these are the best things that they each could be working on.

I was going to say something about lack of incentives, but I think it is also a lack of credible signals that the work is important, is deeply desired by others working in these fields, and would be used to inform deployments of AI. In my view, there isn't much desire for work like this from people in the field and they probably wouldn't use it to inform deployment unless a lot of effort is also added from the author to meet the right people, convince theme to spend the time to take it seriously etc.

Right, I know about Will MacAskill, Joe Carlsmith, and your work in this area, but none of you are working on alignment per se full time or even close to full time AFAIK, and the total effort is clearly far from adequate to the task at hand.

I think some have given up philosophy to work on other things such as AI alignment.

Any other names you can cite?

In my view, there isn't much desire for work like this from people in the field and they probably wouldn't use it to inform deployment unless a lot of effort is also added from the author to meet the right people, convince them to spend the time to take it seriously etc.

Thanks, this makes sense to me, and my follow-up is how concerning do you think this situation is?

One perspective I have is that at this point, several years into a potential AI takeoff, with AI companies now worth trillions in aggregate, alignment teams at AI companies still have virtually no professional philosophical oversight (or outside consultants that they rely on), and are kind of winging it based on their own philosophical beliefs/knowledge. It seems rather like trying to build a particle collider or fusion reaction with no physicists on the staff, only engineers.

(Or worse, unlike engineers' physics knowledge, I doubt that receiving a systematic education in fields like ethics and metaethics is a hard requirement for working as an alignment researcher. And even worse, unlike the situation in physics, we don't even have settled ethics/metaethics/metaphilosophy/etc. that alignment researchers can just learn and apply.)

Maybe the AI companies are reluctant to get professional philosophers involved, because in the fields that do have "professional philosophical oversight", e.g., bioethics, things haven't worked out that well. (E.g. human challenge trials being banned during COVID.) But to me, this would be a signal to yell loudly that our civilization is far from ready to attempt or undergo an AI transition, rather than a license to wing it based on one's own philosophical beliefs/knowledge.

As an outsider, the situation seems cray alarming to me, and I'm confused that nobody else is talking about it, including philosophers like you who are in the same overall space and looking at roughly the same things. I wonder if you have a perspective that makes the situation not quite as alarming as it appears to me.

What happens if this is true and AI improvements will primarily be inference driven? 

It seems like this would be very bad news for AI companies, because customers would have to pay for accurate AI results directly, on a per-run basis. Furthermore, they would have to pay exponentional costs for a linear increase in accuracy. 

As a crude example, would you expect a translation agency to pay four times as much for translations with half as many errors? In either case, you'd still need a human to come along and correct the errors. 

Toby wrote this post, which touches on many of the questions you asked. I think it's been significantly under-hyped!

Thanks. I'm also a bit surprised by the lack of reaction to this series given that:

  • compute scaling has been the biggest story of AI in the last few decades
  • it has dramatically changed
  • very few people are covering these changes
  • it is surprisingly easy to make major crisp contributions to our understanding of it just by analysing the few pieces of publicly available data
  • the changes have major consequences for AI companies, AI timelines, AI risk, and AI governance

For my part, I simply didn't know the series existed until seeing this post, since this is the only post in the series on EAF.  :)

I agree — it seems weird that people haven’t updated very much. 

However, I wrote a similarly-purposed (though much less rigorous) post entitled “How To Update if Pre-Training is Dead,” and Vladmir Nesov wrote the following comment (before GPT 5 release), which I would be curious to hear your thoughts on:


Frontier AI training compute is currently increasing about 12x every two years, from about 7e18 FLOP/s in 2022 (24K A100s, 0.3e15 BF16 FLOP/s per chip), to about 1e20 FLOP/s in 2024 (100K H100s, 1e15 BF16 FLOP/s per chip), to 1e21 FLOP/s in 2026 (Crusoe/Oracle/OpenAI Abilene system, 400K chips in GB200/GB300 NVL72 racks, 2.5e15 BF16 FLOP/s per chip). If this trend takes another step, we'll have 1.2e22 FLOP/s in 2028 (though it'll plausibly take a bit longer to get there, maybe 2.5e22 FLOP/s in 2030 instead), with 5 GW training systems.

So the change between GPT-4 and GPT-4.5 is a third of this path. And GPT-4.5 is very impressive compared to the actual original GPT-4 from Mar 2023, it's only by comparing it to more recent models that GPT-4.5 isn't very useful (in its non-reasoning form, and plausibly without much polish). Some of these more recent models were plausibly trained on 2023 compute (maybe 30K H100s, 3e19 FLOP/s, 4x more than the original GPT-4), or were more lightweight models (not compute optimal, and with fewer total params) trained on 2024 compute (about the same as GPT-4.5).

So what we can actually observe from GPT-4.5 is that increasing compute by 3x is not very impressive, but the whole road from 2022 to 2028-2030 is a 1700x-3500x increase in compute from original GPT-4 (or twice that if we are moving from BF16 to FP8), or 120x-250x from GPT-4.5 (if GPT-4.5 is already trained in FP8, which was hinted at in the release video). Judging the effect of 120x from the effect of 3x is not very convincing. And we haven't really seen what GPT-4.5 can do yet, because it's not a reasoning model.

The best large model inference hardware available until very recently (other than TPUs) is B200 NVL8, with 1.5 TB of HBM, which makes it practical to run long reasoning on models with 1-3T FP8 total params that fit in 1-4 nodes (with room for KV caches). But the new GB200 NVL72s that are only starting to get online in significant numbers very recently each have 13.7 TB of HBM, which means you can fit a 7T FP8 total param model in just one rack (scale-up world), and in principle 10-30T FP8 param models in 1-4 racks, an enormous change. The Rubin Ultra NVL576 racks of 2028 will each have 147 TB of HBM, another 10x jump.

If GPT-4.5 was pretrained for 3 months at 40% compute utilization on a 1e20 FLOP/s system of 2024 (100K H100s), it had about 3e26 BF16 FLOPs of pretraining, or alternatively 6e26 FP8 FLOPs. For a model with 1:8 sparsity (active:total params), it's compute optimal to maybe use 120 tokens/param (40 tokens/param from Llama-3-405B, 3x that from 1:8 sparsity). So a 5e26 FLOPs of pretraining will make about 830B active params compute optimal, which means 7T total params. The overhead for running this on B200s is significant, but in FP8 the model fits in a single GB200 NVL72 rack. Possibly the number of total params is even greater, but fitting in one rack for the first model of the GB200 NVL72 era makes sense.

So with GB200 NVL72s, it becomes practical to run (or train with RLVR) a compute optimal 1:8 sparse MoE model pretrained on 2024 compute (100K H100s) with long reasoning traces (in thinking mode). Possibly this is what they are calling "GPT-5".

Going in the opposite direction in raw compute, but with more recent algorithmic improvements, there's DeepSeek-R1-0528 (37B active params, a reasoning model) and Kimi K2 (30B active params, a non-reasoning model), both pretrained for about 3e24 FLOPs and 15T tokens, 100x-200x less than GPT-4.5, but with much more sparsity than GPT-4.5 could plausibly have. This gives the smaller models about 2x more in effective compute, but also they might be 2x overtrained compared to compute optimal (which might be 240 tokens/param, from taking 6x the dense value for 1:32 sparsity), so maybe the advantage of GPT-4.5 comes out to 70x-140x. I think this is a more useful point of comparison than the original GPT-4, as a way of estimating the impact of 5 GW training systems of 2028-2030 compared to 100K H100s of 2024.

I don't know what to make of that. Obviously Vladimir knows a lot about state of the art compute, but there are so many details there without them being drawn together into a coherent point that really disagrees with you or me on this. 

It does sound like he is making the argument that GPT 4.5 was actually fine and on trend. I don't really believe this, and don't think OpenAI believed it either (there are various leaks they were disappointed with it, they barely announced it, and then they shelved it almost immediately). 

I don't think the argument about original GPT-4 really works. It improved because of post-training, but did they also add that post-training on GPT-4.5? If so, then the 10x compute really does add little. If not, then why not? Why is OpenAI's revealed preference to not put much effort into enhancing their most expensive ever system if not because they didn't think it was that good? 

There is a similar story re reasoning models. It is true that in many ways the advanced reasoning versions of GPT-4o (e.g. o3) are superior to GPT-4.5, but why not make it a reasoning model too? If that's because it would use too much compute or be too slow for users due to latency, then these are big flaws with scaling up larger models.

Shouldn't we be able to point to some objective benchmark if GPT-4.5 was really off trend? It got 10x the SWE-Bench score of GPT-4. That seems like solid evidence that additional pretraining continued to produce the same magnitude of improvements as previous scaleups. If there were now even more efficient ways than that to improve capabilities, like RL post-training on smaller o-series models, why would you expect OpenAI not to focus their efforts there instead? RL was producing gains and hadn't been scaled as much as self-supervised pretraining, so it was obvious where to invest marginal dollars. GPT-5 is better and faster than 4.5. This doesn't mean pretraining suddenly stopped working or went off trend from scaling laws though. 

It's very difficult to do this with benchmarks, because as the models improve benchmarks come and go. Things that used to be so hard that it couldn't do better than chance quickly become saturated and we look for the next thing, then the one after that, and so on. For me, the fact that GPT-4 -> GPT4.5 seemed to involve climbing about half of one benchmark was slower progress than I expected (and the leaks from OpenAI suggest they had similar views to me). When GPT-3.5 was replaced by GPT-4, people were losing their minds about it — both internally and on launch day. Entirely new benchmarks were needed to deal with what it could do. I didn't see any of that for GPT-4.5.

I agree with you that the evidence is subjective and disputable. But I don't think it is a case where the burden of proof is disproportionately on those saying it was a smaller jump than previously.

(Also, note that this doesn't have much to do with the actual scaling laws, which are a measure of how much prediction error of the next token goes down when you 10x the training compute. I don't have reason to think that has gone off trend. But I'm saying that the real-world gains from this (or the intuitive measure of intelligence) has diminished, compared to the previous few 10x jumps. This is definitely compatible. e.g. if the model only trained on wikipedia plus an unending supply of nursery rhymes, its prediction error would continue to drop as more training happened, but its real world capabilities wouldn't improve by continued 10x jumps in the number of nursery rhymes added in. I think the real world is like this where GPT-4-level systems are already trained on most books ever written and much of the recorded knowledge of the last 10,000 years of civilisation, and it makes sense that adding more Reddit comments wouldn't move the needle much.)

Yes, what you are scaling matters just as much as the fact that you are scaling. So now developers are scaling RL post training and pretraining using higher quality synthetic data pipelines. If the point is just that training on average internet text provides diminishing real world returns in many real-world use cases, then that seems defensible; that certainly doesn't seem to be the main recipe any company is using for pushing the frontier right now. But it seems like people often mistake this for something stronger like "all training is now facing insurmountable barriers to continued real world gains" or "scaling laws are slowing down across the board" or "it didn't produce significant gains on meaningful tasks so scaling is done." I mentioned SWE-Bench because that seems to suggest significant real world utility improvements rather than trivial prediction loss decrease. I also don't think it's clear that there is such an absolute separation here - to model the data you have to model the world in some sense. If you continue feeding multimodal LLM agents the right data in the right way, they continue improving on real world tasks. 

Agreed! The series has been valuable for my personal thinking around this (I quoted the post I linked above as late as yesterday.) Imo, more people should be paying attention to this.

The crux for me is I don't agree that compute scaling has dramatically changed, because I don't think pre-training scaling has gotten much worse returns.

I'll just add a comment that Ord did an 80k podcast on this topic here.

This is a really compelling post. This seems like the sort of post that could have a meaningful impact on the opinions of people in the finance/investment world who are thinking about AI. I would be curious to see how equity research analysts and so on would react to this post. 

This is a very strong conclusion and seems very consequential if true:

This leaves us with inference-scaling as the remaining form of compute-scaling.

I was curious to see if you had a similar analysis that supports the assertion that "the scaling up of pre-training compute also stalled". Let me know if I missed something important. For the convenience of other readers, here are some pertinent quotes from your previous posts. 

From "Inference Scaling Reshapes AI Governance" (February 12, 2025):

But recent reports from unnamed employees at the leading labs suggest that their attempts to scale up pre-training substantially beyond the size of GPT-4 have led to only modest gains which are insufficient to justify continuing such scaling and perhaps even insufficient to warrant public deployment of those models. A possible reason is that they are running out of high-quality training data. While the scaling laws might still be operating (given sufficient compute and data, the models would keep improving), the ability to harness them through rapid scaling of pre-training may not.

 

There is a lot of uncertainty about what is changing and what will come next.

One question is the rate at which pre-training will continue to scale. It may be that pre-training has topped out at a GPT-4 scale model, or it may continue increasing, but at a slower rate than before. Epoch AI suggests the compute used in LLM pre-training has been growing at about 5x per year from 2020 to 2024. It seems like that rate has now fallen, but it is not yet clear if it has gone to zero (with AI progress coming from things other than pre-training compute) or to some fraction of its previous rate.

 

This strongly suggests that even though there are still many more unused tokens on the indexed web (about 30x as many as are used in GPT-4 level pre-training), performance is being limited by lack of high-quality tokens. There have already been attempts to supplement the training data with synthetic data (data produced by an LLM), but if the issue is more about quality than raw quantity, then they need the best synthetic data they can get.

From "The Extreme Inefficiency of RL for Frontier Models" (September 19, 2025):

LLMs and next-token prediction pre-training were the most amazing boost to generality that the field of AI has ever seen, going a long way towards making AGI seem feasible. This self-supervised learning allowed it to imbibe not just knowledge about a single game, or even all board games, or even all games in general, but every single topic that humans have ever written about — from ancient Greek philosophy to particle physics to every facet of pop culture. While their skills in each domain have real limits, the breadth had never been seen before. However, because they are learning so heavily from human generated data they find it easier to climb towards the human range of abilities than to proceed beyond them. LLMs can surpass humans at certain tasks, but we’d typically expect at least a slow-down in the learning curve as they reach the top of the human-range and can no longer copy our best techniques — like a country shifting from fast catch-up growth to slower frontier growth.

I broadly don't think inference scaling is the only path, primarily because I disagree with the claim that pre-training returns declined much, and attribute the GPT-4.5 evidence as mostly a case of broken compute promises making everything disappointing.

I also have a hypothesis that current RL is mostly serving as an elicitation method for pre-trained AIs.

We shall see in 2026-2027 whether this remains true.

I disagree with the claim that pre-training returns declined much

Could you elaborate or link to somewhere where someone makes this argument? I'm curious to see if a strong defense can be made of self-supervised pre-training of LLMs continuing to scale and deliver worthwhile, significant benefits.

I currently can't find a source, but to elaborate a little bit, my reason for thinking this is that the GPT-4 to GPT-4.5 scaleup used 15x the compute instead of 100x the compute, and I remember that 10x compute is enough to be competitive with the current algorithmic improvements that don't involve scaling up models, whereas 100x compute increases result in the wow moments we associated with GPT-3 to GPT-4, and the GPT-5 release was not a scale up of compute, but instead productionizing GPT-4.5.

I'm more in the camp of "I find little reason to believe that pre-training returns have declined" here.

  1. It seems more likely that RL does actually allow LLMs to learn new skills.
  2. RL + LLMs is still pretty new but we already have clear signs it exhibits scaling laws with the right setup just like self-supervised pretraining. This time they appear to be sigmoidal, probably based on something like each policy or goal or environment they're trained with. It has been about 1 year since o1-preview and maybe this was being worked on to some degree about a year before that.
  3. The Grok chart contains no numbers, which is so strange I don't think you can conclude much from it except "we used more RL than last time." It also seems likely that they might not yet be as efficient as OpenAI and DeepMind, who have been in the RL game for much longer with projects like AlphaZero and AlphaStar. 

The Grok chart contains no numbers, which is so strange I don't think you can conclude much from it except "we used more RL than last time."

Isn't the point just that the amount of compute used for RL training is now roughly the same as the amount of compute used for self-supervised pre-training? Because if this is true, then obviously scaling up RL training compute another 1,000,000x is obviously not feasible.

My main takeaway from this post is not whether RL training would continue to provide benefits if it were scaled up another 1,000,000x, just that the world doesn't have nearly enough GPUs, electricity, or investment capital for that to be possible.

Maybe or maybe not - people also thought we would run out of training data years ago. But that has been pushed back and maybe won't really matter given improvements in synthetic data, multimodal learning, and algorithmic efficiency. 

What part do you think is uncertain? Do you think RL training could become orders of magnitude more compute efficient? 

Thank you, Toby et al., for this characteristically clear and compelling analysis and discussion. The argument that RL scaling is breathtakingly inefficient and may be hitting a hard limit is a crucial consideration for timelines.

This post made me think about the nature of this bottleneck, and I'm curious to get the forum's thoughts on a high-level analogy. I'm not an ML researcher, so I’m offering this with low confidence, but it seems to me there are at least two different "types" of hard problems.

1. A Science Bottleneck (Fusion Power): Here, the barrier appears to be fundamental physics. We need to contain a plasma that is inherently unstable at temperatures hotter than the sun. Despite decades of massive investment and brilliant minds, we can't easily change the underlying laws of physics that make this so difficult. Progress is slow, and incentives alone can't force a breakthrough.

2. An Engineering Bottleneck (Manhattan Project): Here, the core scientific principle was known (nuclear fission). The barrier was a set of unprecedented engineering challenges: how to enrich enough uranium, how to build a stable reactor, etc. The solution, driven by immense incentives, was a brute-force, parallel search for any viable engineering path (e.g., pursuing gaseous diffusion, electromagnetic separation, and plutonium production all at once).

This brings me back to the RL scaling issue. I'm wondering which category this bottleneck falls into.

From the outside, it feels more like an engineering or "Manhattan Project" problem. The core scientific discovery (the Transformer architecture, the general scaling paradigm) seems to be in place. The bottleneck Ord identifies is that one specific method (RL - likely PPO based) is significantly compute-inefficient and hard to continue scaling.

But the massive commercial incentives at frontier labs aren't just to make this one inefficient method 1,000 or 1,000,000x bigger. The incentive is to invent new, more efficient methods to achieve the same goal or similar.

We've already seen a small-scale example of this with the rapid shift from complex RLHF to the more efficient Direct Preference Optimization (DPO). This suggests the problem may not be a fundamental "we can't continue to improve models" barrier, but an engineering one: "this way of improving models is too expensive and unstable."

If this analogy holds, it seems plausible that the proprietary work at the frontier isn't just grinding on the inefficient RL problem, but is in a "Manhattan"-style race to find a new algorithm or architecture that bypasses this specific bottleneck.

This perspective makes me less confident that this particular bottleneck will be the one that indefinitely pushes out timelines, as it seems like exactly the kind of challenge that massive, concentrated incentives are historically good at solving.

I could be completely mischaracterizing the nature of the challenge, though, and still feel quite uncertain. I'd be very interested to hear from those with more technical expertise if this framing seems at all relevant or if the RL bottleneck is, in fact, closer to a fundamental science or "Fusion" problem.


 

I understand where you are coming from, and you are certainly not alone in trying to think about AI progress in terms of analogies like this. But I want to explain why I don't think such discussions — which are common — will take us down a productive route. 

I don't think anyone can predict the future based on this kind of reasoning. We can guess and speculate, but that's it. It's always possible, in principle, that, at any time, new knowledge could be discovered that would have a profound impact on technology. How likely is that in any particular case? Very hard to say. Nobody really knows.

There are many sorts of technologies that at least some experts think should, in principle, be possible and for which, at least in theory, there is a very large incentive to create, yet still haven't been created. Fusion is a great example, but I don't think we can draw a clean distinction between "science" on the one hand and "engineering" on the other and say science is much harder and engineering is much easier, such that if we want to find out how hard it will be to make fundamental improvements in reinforcement learning, we just have to figure out whether it's a "science" problem or an "engineering" problem. There's a vast variety of science problems and engineering problems. Some science problems are much easier than some engineering problems. For example, discovering a new exoplanet and figuring out its properties, a science problem, is much easier than building a human settlement on Mars, an engineering problem.

What I want to push back against in EA discourse and AGI discourse more generally is unrigorous thinking. In the harsh but funny words of the physicist David Deutsch (emphasis added):

The search for hard-to-vary explanations is the origin of all progress. It's the basic regulating principle of the Enlightenment. So, in science, two false approaches blight progress. One's well-known: untestable theories. But the more important one is explanationless theories. Whenever you're told that some existing statistical trend will continue but you aren't given a hard-to-vary account of what causes that trend, you're being told a wizard did it.

If LLMs' progress on certain tasks has improved over the last 7 years for certain specific reasons, and now we have good reasons to think those reasons for improvement won't be there much longer going forward, then of course you can say, "Well, maybe someone will come up with some other way to keep progress going!" Maybe they will, maybe they won't, who knows? We've transitioned from a rigorous argument based on evidence about LLM performance (mainly performance on benchmarks) and a causal theory of what accounts for that progress (mainly scaling of data and compute) to a hand-wavey idea about how maybe somebody will figure it out. Scaling you can track on a chart, but somebody figuring out a new idea like that is not something you can rigorously say will take 2 years or 20 years or 200 years.

The unrigorous move is:

There is a statistical trend that is caused by certain factors. At some point fairly soon, those factors will not be able to continue causing the statistical trend. But I want to keep extrapolating the statistical trend, so I'm going to speculate that new causal factors will appear that will keep the trend going. 

That doesn't make any sense! 

To be fair, people do try to reach for explanations of why new causal factors will appear, usually appealing to increasing inputs to AI innovation, such as the number of AI researchers, the number of papers published, improvements in computer hardware, and the amount of financial investment. But we currently have no way of predicting how exactly the inputs to science, technology, or engineering will translate into the generation of new ideas that keep progress going. We can say more is better, but we can't say X amount of dollars, Y amount of researchers, and Z amount of papers is enough to continue recent LLM progress. So, this is not a rigorous theory or model or explanation, but just a hand-wavey guess about what might happen. And to be clear, it might happen, but it might not, and we simply don't know! (And I don't think there's any rigour or much value in the technique of trying to squeeze Bayesian blood from a stone of uncertainty by asking people to guess numbers of how probable they think something is.)

So, it could take 2 years and it could take 20 years (or 200 years), and we don't know, and we probably can't find out any other way than just waiting and seeing. But how should we act, given that uncertainty?

Well, how would we have acted if LLMs had made no progress over the last 7 years? The same argument would have applied: anyone at any time could come up with the right ideas to make AI progress go forward. Making reinforcement learning orders of magnitude more efficient is something someone could have done 7 years ago. It more or less has nothing to do with LLMs. Absent the progress in LLMs would we have thought: "oh, surely, somebody's going to come up with a way to make RL vastly more efficient sometime soon"? Probably not, so we probably shouldn't think that now. If the reason for thinking that is just wanting to keep extrapolating the statistical trend, that's not a good reason. 

There is more investment in AI now in the capital markets, but, as I said above, that doesn't allow us to predict anything specific. Moreover, it seems like very little of the investment is going to fundamental AI research. It seems like almost all the money is going toward expenses much more directly relating to productizing LLMs, such as incremental R&D, the compute cost of training runs, and building datacentres (plus all the other expenses related to running a large tech company). 

Curated and popular this week
Relevant opportunities