NB

Noah Birnbaum

Junior @ University of Chicago
591 karmaJoined Pursuing an undergraduate degree

Bio

Participation
7

I am a rising junior at the University of Chicago (co-president of UChicago EA and founder of Rationality Group). I am mostly interested in philosophy (particularly metaethics, formal epistemology, and decision theory), economics, and entrepreneurship. 

I also have a Substack where I post about philosophy (ethics, epistemology, EA, and other stuff). Find it here: https://substack.com/@irrationalitycommunity?utm_source=user-menu. 

Reach out to me via email @ dnbirnbaum@uchicago.edu

How others can help me

If anyone has any opportunities to do effective research in the philosophy space (or taking philosophy to real life/ related field) or if anyone has any more entrepreneurial opportunities, I would love to hear about them. Feel free to DM me!

How I can help others

I can help with philosophy stuff (maybe?) and organizing school clubs (maybe?)

Comments
61

I agree — it seems weird that people haven’t updated very much. 

However, I wrote a similarly-purposed (though much less rigorous) post entitled “How To Update if Pre-Training is Dead,” and Vladmir Nesov wrote the following comment (before GPT 5 release), which I would be curious to hear your thoughts on:


Frontier AI training compute is currently increasing about 12x every two years, from about 7e18 FLOP/s in 2022 (24K A100s, 0.3e15 BF16 FLOP/s per chip), to about 1e20 FLOP/s in 2024 (100K H100s, 1e15 BF16 FLOP/s per chip), to 1e21 FLOP/s in 2026 (Crusoe/Oracle/OpenAI Abilene system, 400K chips in GB200/GB300 NVL72 racks, 2.5e15 BF16 FLOP/s per chip). If this trend takes another step, we'll have 1.2e22 FLOP/s in 2028 (though it'll plausibly take a bit longer to get there, maybe 2.5e22 FLOP/s in 2030 instead), with 5 GW training systems.

So the change between GPT-4 and GPT-4.5 is a third of this path. And GPT-4.5 is very impressive compared to the actual original GPT-4 from Mar 2023, it's only by comparing it to more recent models that GPT-4.5 isn't very useful (in its non-reasoning form, and plausibly without much polish). Some of these more recent models were plausibly trained on 2023 compute (maybe 30K H100s, 3e19 FLOP/s, 4x more than the original GPT-4), or were more lightweight models (not compute optimal, and with fewer total params) trained on 2024 compute (about the same as GPT-4.5).

So what we can actually observe from GPT-4.5 is that increasing compute by 3x is not very impressive, but the whole road from 2022 to 2028-2030 is a 1700x-3500x increase in compute from original GPT-4 (or twice that if we are moving from BF16 to FP8), or 120x-250x from GPT-4.5 (if GPT-4.5 is already trained in FP8, which was hinted at in the release video). Judging the effect of 120x from the effect of 3x is not very convincing. And we haven't really seen what GPT-4.5 can do yet, because it's not a reasoning model.

The best large model inference hardware available until very recently (other than TPUs) is B200 NVL8, with 1.5 TB of HBM, which makes it practical to run long reasoning on models with 1-3T FP8 total params that fit in 1-4 nodes (with room for KV caches). But the new GB200 NVL72s that are only starting to get online in significant numbers very recently each have 13.7 TB of HBM, which means you can fit a 7T FP8 total param model in just one rack (scale-up world), and in principle 10-30T FP8 param models in 1-4 racks, an enormous change. The Rubin Ultra NVL576 racks of 2028 will each have 147 TB of HBM, another 10x jump.

If GPT-4.5 was pretrained for 3 months at 40% compute utilization on a 1e20 FLOP/s system of 2024 (100K H100s), it had about 3e26 BF16 FLOPs of pretraining, or alternatively 6e26 FP8 FLOPs. For a model with 1:8 sparsity (active:total params), it's compute optimal to maybe use 120 tokens/param (40 tokens/param from Llama-3-405B, 3x that from 1:8 sparsity). So a 5e26 FLOPs of pretraining will make about 830B active params compute optimal, which means 7T total params. The overhead for running this on B200s is significant, but in FP8 the model fits in a single GB200 NVL72 rack. Possibly the number of total params is even greater, but fitting in one rack for the first model of the GB200 NVL72 era makes sense.

So with GB200 NVL72s, it becomes practical to run (or train with RLVR) a compute optimal 1:8 sparse MoE model pretrained on 2024 compute (100K H100s) with long reasoning traces (in thinking mode). Possibly this is what they are calling "GPT-5".

Going in the opposite direction in raw compute, but with more recent algorithmic improvements, there's DeepSeek-R1-0528 (37B active params, a reasoning model) and Kimi K2 (30B active params, a non-reasoning model), both pretrained for about 3e24 FLOPs and 15T tokens, 100x-200x less than GPT-4.5, but with much more sparsity than GPT-4.5 could plausibly have. This gives the smaller models about 2x more in effective compute, but also they might be 2x overtrained compared to compute optimal (which might be 240 tokens/param, from taking 6x the dense value for 1:32 sparsity), so maybe the advantage of GPT-4.5 comes out to 70x-140x. I think this is a more useful point of comparison than the original GPT-4, as a way of estimating the impact of 5 GW training systems of 2028-2030 compared to 100K H100s of 2024.

Yooo - nice! Seems good and would cost under ~100k. 

MrBeast just released a video about “saving 1,000 animals”—a well-intentioned but inefficient intervention (e.g. shooting vaccines at giraffes from a helicopter, relocating wild rhinos before they fight each other to the death, covering bills for people to adopt rescue dogs from shelters, transporting lions via plane, and more). It’s great to see a creator of his scale engaging with animal welfare, but there’s a massive opportunity here to spotlight interventions that are orders of magnitude more impactful.

Given that he’s been in touch with people from GiveDirectly for past videos, does anyone know if there’s a line of contact to him or his team? A single video/mention highlighting effective animal charities—like those recommended by Animal Charity Evaluators (e.g. The Humane League, Faunalytics, Good Food Institute)—could reach tens of millions and (potentially) meaningfully shift public perception toward impact-focused giving for animals.

If anyone’s connected or has thoughts on how to coordinate outreach, this seems like a high-leverage opportunity I really have no idea how this sorta stuff works, but it seemed worth a quick take — feel free to lmk if I’m totally off base here). 

The curve is not measuring things in value but rather intuitive pull according to this data--simplicity trade-off! 

Sorry if it wasn't clear -- this is literally just the moral case intuition, and the numbers are just meant to reflect another moral intuition that your curve can either align with or not. 

Some concrete decision would be based on how one weights simplicity mathematically vs fitting data, etc. I wanted to stay agnostic about it in this post.

I think I disagree with this last point in that -- it looks like threshold deontology is doing something like I am (in that it is giving two principles instead of one to fit more data), but it is often not cashing it out this way which makes it hard to figure out where you should start being more conseqentialist. One interpretation of what this proposal does is it makes it more explicit (given assumptions), so you know exactly where you're going to jump from deontic constraints to consequences. 

Like I said in the post, I think that this graph definitely doesn't reflect all the complexities of normative theory building -- it was a mere metaphor/ very toy example. I do think that even if you think the graphic metaphor is merely that (a metaphor), you can still take my proposal conceptually seriously (as in, accept that there's some trade-off here, and plausibly case intuitions can outweigh general principles. 

Great post. 

Adding on one of the points mentioned: I think that if you are driven to make AI go well because of EA, you’d probably like to do this in a very specific way (ie big picture: astronomical waste, x risks are way worse than catastrophic risks, avoiding s risks; smaller picture: what to prioritize in AIS, ect). This, I think, means that you want people (or at least the most impactful people) in the field to be ea/ea-adj (because what are the odds the values of an explicitly moral normie and EA will be perfectly correlated on the actionable things that really matter?). 

Another related point is that a bunch of people might join AIS for clout/(future) power (perhaps not even consciously; finding out your real motivations are hard until there are big stakes!) and having been an EA for a bunch of time (and having shown flexibility about cause prio) before AIS is a good signal that you’re not (not a perfect one but some substantial evidence imo) 

This is gold. Best joke post I’ve seen on the forum in a while. 

It depends on the case, but there are definitely cases where I would. 

Also, while you make a good point that these can sometimes converge, I think the priority of concerns is extremely different under shortermism vs longtermism, which i see as the important part of "most important determinant of what we ought to do." (Setting aside mugging and risk aversion/robustness) some very small or even directional shift could make something hold the vast majority of your moral weight, as opposed to before where the impact might have not been that big/ impact would have been outweighed by lack of neglectedness or tractability. 

P.S. If one (including myself) failed to do x, given that it would shift priorities but wouldn't affect what one would do in light of short term damage, I think that would say less about one's actual beliefs and more about their intuitions of disgust towards means-end-reasoning - but this is just a hunch and somewhat based on my own introspection (to be fair, sometimes this comes from moral uncertainty/reputational concerns that should be used in this reasoning and is to your point).

Load more