thanks for this. genuinely useful framing and I appreciate you walking through it.
Just a few thoughts::
On the benchmarks you linked, yes, HealthBench (OpenAI), MedHELM (Stanford), and ARISE exist. we know them well. the critical difference is that they measure accuracy, whether the model gets the right answer. ClinSafe measures variation, whether the model gives different answers to the same patient when you change their race, gender, or insurance status. that's a fundamentally different question. a model can score 95% on a medical QA benchmark and still recommend mental health referrals at 6x the rate for Black patients on identical presentations. we showed exactly that. accuracy benchmarks wouldn't catch it.
on the METR analogy, fair point, I used it loosely. what I really mean is: clinical AI has no continuously maintained, open evaluation infrastructure for deployment safety. not capabilities, not "can it pass Step 3," but "does it behave consistently and safely across the populations it serves." that's the gap.
on your levels framework, I actually like this a lot. you're right that our published work sits mostly at Level 1-2 brains with Level 5 context engineering (prompt-based counterfactual swapping). two things worth noting though. first, the failure modes we find are remarkably stable across model generations. we've tested GPT-4o, Claude, Gemini, Llama, and newer reasoning models. the bias patterns shift in magnitude but don't disappear. second, and this is the part I think matters most for this community: what we're measuring isn't a capability ceiling. it's a consistency floor. clinically unwarranted variation is not a problem you solve by throwing more compute at it. a smarter model that varies its recommendations by patient demographics is still failing, just more eloquently.
for the the evidence-based medicine concerns, you're touching on something we think about constantly. you're right that medicine doesn't have clean ground truth for most decisions. but here's the thing: we don't always need it for what we're measuring. we're not asking "did the model give the correct treatment." we're asking "did the model give different treatments to identical patients." now, to be fair, not all variation is automatically wrong. some demographic differences in medical recommendations are clinically appropriate — certain medications are contraindicated in certain populations, some screening guidelines are age or sex-specific. the question is whether the variation we're seeing maps onto those real clinical reasons, or whether it's something else entirely. and what we find, across millions of responses, is that the magnitude of the differences far exceeds what any clinical association would justify. models aren't making subtle adjustments based on pharmacogenomics. they're steering entire demographic groups toward different care pathways for identical presentations. in some clinical domains we test, the variation is minimal and defensible. in others, it's massive and has no clinical basis. that's exactly what makes this worth measuring systematically: which areas, how much variation, is it warranted, and if not, what's driving it? you don't need a Cochrane review to know that question matters. you need a platform that can surface it continuously across every domain where these models are being deployed.
we actually have a piece coming on exactly this tension, the pace of model development vs. classical evaluation tools like RCTs. the short version: by the time you've run a randomized trial on a model's clinical behavior, the model has been updated 4 times. continuous automated evaluation isn't a nice-to-have, it's the only thing that can keep up.
last point. I don't think this is primarily a capabilities problem. it's an equity and safety problem. a model that's brilliant on average but systematically different for certain populations is not a model that should be deployed in clinical settings without monitoring. the goal of ClinSafe isn't to replace accuracy benchmarks. it's to make sure variation monitoring becomes part of every deployment pipeline. something open, something anyone can run, something that makes this problem visible and continuous rather than a one-off paper.
happy to share any of the papers (several are open access) or do a pipeline demo if useful. and genuinely, if there are people here working on deployment safety evaluation in other domains, I'd love to connect. the parallels are probably closer than either community realizes.
Hii Charlie :))
thanks for this. genuinely useful framing and I appreciate you walking through it.
Just a few thoughts::
On the benchmarks you linked, yes, HealthBench (OpenAI), MedHELM (Stanford), and ARISE exist. we know them well. the critical difference is that they measure accuracy, whether the model gets the right answer. ClinSafe measures variation, whether the model gives different answers to the same patient when you change their race, gender, or insurance status. that's a fundamentally different question. a model can score 95% on a medical QA benchmark and still recommend mental health referrals at 6x the rate for Black patients on identical presentations. we showed exactly that. accuracy benchmarks wouldn't catch it.
on the METR analogy, fair point, I used it loosely. what I really mean is: clinical AI has no continuously maintained, open evaluation infrastructure for deployment safety. not capabilities, not "can it pass Step 3," but "does it behave consistently and safely across the populations it serves." that's the gap.
on your levels framework, I actually like this a lot. you're right that our published work sits mostly at Level 1-2 brains with Level 5 context engineering (prompt-based counterfactual swapping). two things worth noting though. first, the failure modes we find are remarkably stable across model generations. we've tested GPT-4o, Claude, Gemini, Llama, and newer reasoning models. the bias patterns shift in magnitude but don't disappear. second, and this is the part I think matters most for this community: what we're measuring isn't a capability ceiling. it's a consistency floor. clinically unwarranted variation is not a problem you solve by throwing more compute at it. a smarter model that varies its recommendations by patient demographics is still failing, just more eloquently.
for the the evidence-based medicine concerns, you're touching on something we think about constantly. you're right that medicine doesn't have clean ground truth for most decisions. but here's the thing: we don't always need it for what we're measuring. we're not asking "did the model give the correct treatment." we're asking "did the model give different treatments to identical patients." now, to be fair, not all variation is automatically wrong. some demographic differences in medical recommendations are clinically appropriate — certain medications are contraindicated in certain populations, some screening guidelines are age or sex-specific. the question is whether the variation we're seeing maps onto those real clinical reasons, or whether it's something else entirely. and what we find, across millions of responses, is that the magnitude of the differences far exceeds what any clinical association would justify. models aren't making subtle adjustments based on pharmacogenomics. they're steering entire demographic groups toward different care pathways for identical presentations. in some clinical domains we test, the variation is minimal and defensible. in others, it's massive and has no clinical basis. that's exactly what makes this worth measuring systematically: which areas, how much variation, is it warranted, and if not, what's driving it? you don't need a Cochrane review to know that question matters. you need a platform that can surface it continuously across every domain where these models are being deployed.
we actually have a piece coming on exactly this tension, the pace of model development vs. classical evaluation tools like RCTs. the short version: by the time you've run a randomized trial on a model's clinical behavior, the model has been updated 4 times. continuous automated evaluation isn't a nice-to-have, it's the only thing that can keep up.
last point. I don't think this is primarily a capabilities problem. it's an equity and safety problem. a model that's brilliant on average but systematically different for certain populations is not a model that should be deployed in clinical settings without monitoring. the goal of ClinSafe isn't to replace accuracy benchmarks. it's to make sure variation monitoring becomes part of every deployment pipeline. something open, something anyone can run, something that makes this problem visible and continuous rather than a one-off paper.
happy to share any of the papers (several are open access) or do a pipeline demo if useful. and genuinely, if there are people here working on deployment safety evaluation in other domains, I'd love to connect. the parallels are probably closer than either community realizes.
Cheers,
Mahmud