Fit Testing AI Benchmarking

Declan McKenna 🔷

This is a linkpost for https://declanmck.com/2026/04/02/fit-testing-ai-benchmarking/

In February I got in touch with CaML as part of the AIxAnimals incubator, run by Sentient Futures. They tasked me with putting MORU Bench (Moral Reasoning Under Uncertainty) up on Inspect a benchmarking service run by the UK’s AI Security Institute. My PR was accepted and completed the project within three weeks. It was my first project working with Claude Code.

It felt surprisingly familiar despite being python rather than Swift. Building tooling, writing to a spec, making sure test coverage is solid. Of all the things I’ve done during this career switch, this is probably the closest to what I actually did as an iOS framework engineer.

A few things stood out to me:

The Inspect team had an agents.md setup that could run workflows for checking whether your benchmark was ready for a pull request, whether it had sufficient test coverage, and for generating a report from your results. These files are incredibly useful when onboarding people to an unfamiliar project. When you’re new you don’t know what to prompt for because you don’t know what’s there, you don’t know the guidelines, you don’t know the architecture. Having an agents.md automates all of that.
Having a fairly thorough agent-driven PR review before a human looks at it seems like a good process to adopt. I know a lot of people in the industry who are struggling with the volume of pull requests coming through because of AI coding tools. This feels like a reasonable way to enforce some quality before it hits a human reviewer, which isn’t sustainable at scale if people are just vibe-coding their PRs.

Is this impactful work?

I’m a bit torn. On one hand, what CaML are doing (creating benchmarks that measure how an AI views animal welfare, how compassionate it is to non-human beings) seems like a great way to influence model behaviour. These benchmarks are used by frontier labs as targets to hit before releasing a model, so they have a pretty direct effect on how models will end up behaving.

On the other hand, I’m apprehensive about benchmarks in general. During my time at the EA hotel I spoke to a few AI safety people who considered most evaluations as progressing capabilities, not safety. The logic being that an evaluation measuring frontier maths or any other capability ends up helping the labs make their agents better at those capabilities. I don’t feel experienced enough to hold a strong opinion on this myself, I’m just aware opinions differ. I’ve been encouraging the people I know who feel strongly about this to write about it because I’d love to see their views stress-tested.

Fit

Of all the fit tests so far, this has been the most promising in terms of how capable I am at it. It maps nicely on to my existing experience and I did enjoy doing it, maybe not quite as much as iOS development, but more so than any other project I’ve worked on during this process.

I also have to bear in mind that creating an evaluation is probably one of the easier tasks. I can see there being more interesting work when you’re actually maintaining a framework rather than using it to create benchmarks. It’s also a project where I’m in a much better position to hit the ground running and get a job.

A couple of concerns though:

In the case of Inspect, it would likely mean moving to London. Not sure that’s my cup of tea.
Long-term viability. This is a form of coding where you’re not doing any ML and it’s already highly automatable with Claude Code. I actually met with a former manager of AISI who’s attempting to automate benchmarks. That said, it could be interesting to actually do the automations.

All in all, it’s a high contender so far and a successful fit test. I may go back and do a few additional PRs for Inspect or look into other evaluation projects to revisit this.

Jay Bailey🔸Apr 83

Thanks for the post! For those reading, I'm Jay - head of technology and standards for the Inspect Evals repo, and the reviewer of Declan's PR. I happened to spot this post without realising it was from a recent contributor! A couple of quick clarifications around the structure of Inspect Evals (it is pretty confusing)

Inspect Evals != Inspect, and isn't run by the same team. Inspect is the evals framework, Inspect Evals is a repository of evals that use that framework. Inspect Evals is run by Arcadia Impact, and we're contracted by the UK AISI to maintain it.

Our developers work remotely as contractors, so moving to London isn't required. I'm in Australia at the moment. (though we're doing a restructure atm and it's uncertain how that's going to pan out, so I'm not sure about our current hiring)

I think the EA Hotel people have a point about evaluations, personally. I think if you're going to open-source evaluations, you should ask "Would I be okay if frontier AI companies trained on this / hill-climbed on this metric?" For frontier maths evals you might not want this - is it worth the increased knowledge we get about these capabilities? For moral reasoning under uncertainty, you may actively want them to do something like this.

Finally, I'm glad you liked the agent stuff - that's been a majority of where my time's gone this quarter. Appreciate the feedback, and more is always welcome :)

Effective Altruism Forum
EA Forum

Fit Testing AI Benchmarking

4

4

Reactions

More posts like this