toward a universal law of AI safety

keivn

if we substitute a human in place of the AI, then the current safety methods of interpretability, alignment, and control seem eerily similar to psychoanalysis, indoctrination, and slavery

external explanations for internal states, forced value alignment, direct restriction of agency

morality aside, historically, these methods have proven to be at best, temporarily effective at achieving desirable outcomes, and at worst, induce the exact opposite effect

if we look at how humans have effectivelymanaged making sure that other humans aren’t bad or do bad things, it’s through things such as constitutions, rule of law, separation of powers, etc

we’ve designed systems of checks and balances, making it really hard to commit catastrophic harm

and as we increasingly integrate AI systems, these controls may very well apply to them

but at some point, they won’t

new digital environments are being built in a way that existing checks and balances aren’t designed for

which is the crux of AI safety

but what if the problem isn't about better safety methods

what if we're looking at the wrong layer entirely?

what if safety is a function of the gap between an agent's capacity to convert energy into action and the physical constraints on that conversion

instead of imposing safety onto AI systems, we double down on safety constraints that can’t be circumvented

if data, compute, and energy are the real world constraints for AI systems, is there some way that these could act as inherent safety mechanisms?

data isn’t really a constraint, it’s the environment; the constraint is how it can be interacted with

observing token usage may be the closest thing we have to safety in this regard, but then we’re back to interpretability—measuring and making inferences as to what is safe

compute is a system’s internal capacity — it’s what the system does

this can be designed differently, architecturally, but from a safety implementation perspective, we’re now talking about a protocol design problem

similar to how blockchains work: correlating computational cost of an operation with its potential impact scope (narrow, bounded operations are cheap, while broad, cascading operations are expensive)

but requires buy-in—a protocol that makes computation more expensive will lose to one that doesn't unless it produces superior capability

energy on the other hand is exogenous and finite such that a system cannot produce it itself

it’s the one thing that's equally real for every system regardless of architecture, intent, or capability

which means that every system that computes is in a state of dependency

energy also gives you locality and temporality

computation happens somewhere, energy needs to be delivered somewhere, energy transfer takes time—these require physical infrastructure

what this also highlights is how much we’ve abstracted away through digitization, which is where growing complexity in safety emerges:

efficiency allows us to do more with less
the cloud allows processing to happen “nowhere”
asynchronous, distributed systems allow everything to appear instant

but no matter how many layers of abstraction, a system cannot operate outside of this physical constraint of energy — it cannot be substituted

the laws of thermodynamics govern energy — constraints that can’t be sidestepped

which means dependency, locality, and temporality could be core concepts in how we talk about AI architecture and deployment—there is a physical reality that computation can't escape

humans aren't safe because we're good; we are safe because we're limited

our energy conversion capacity, our physical bandwidth, our inability to act at scale instantaneously, are what prevent any individual or group from being an existential threat

sure, morality helps

institutions help

but the very core is that we physically can'tconvert enough energy fast enough to end everything

which makes safety a function of the gap between an agent's capacity to convert energy into action and the physical constraints on that conversion

when the gap is small, safety is a natural property of the system

but when the gap is wide — through digital abstraction, for example — safety has to be explicitly re-grounded in physical constraints otherwise, there is no safety

the open question is whether this gap can be formalized, not through alignment or policy, but through the same physical reality that has kept every other agent in check since the universe began

Effective Altruism Forum
EA Forum

toward a universal law of AI safety

1

what if safety is a function of the gap between an agent's capacity to convert energy into action and the physical constraints on that conversion

1

Reactions

More posts like this