A Testable Approach to AI Value Alignment

John Matrix

Introduction

Consider the following thought experiment: superintelligence arrives to us in the 1800s and we attempt to align it to our values. After the alignment process, it permits our widely accepted practice of slavery. Does this indicate an alignment failure or success? Should this be seen as a failure, I argue it suggests incompleteness in our current framing of alignment. I argue, that when goals are describing a discoverable property, the orthogonality thesis does not rule out convergence of values through logical reasoning and empirical investigation. If correct, this could significantly change how we approach one of the most important problems facing humanity: mitigation of existential risk from superhuman artificial intelligence.

Engaging with the orthogonality thesis

According to Nick Bostrom’s orthogonality thesis, intelligence and terminal goals are independent of each other. A system of any given intelligence is compatible with any arbitrary goal. The thesis is widely accepted for various strong reasons.

Lack of logical necessity: a superintelligent system does not logically require a certain goal structure. As Bostrom famously put it, a paperclip maximizer is possible.
Proof from theoretical existence: AIXI, a theoretical system, demonstrates that arbitrarily intelligent agents can be specified with arbitrary utility functions. The mathematics behind it works despite of which utility function is chosen.
Empirical observations: an increase in intelligence does not appear to correlate with convergence of specific values or goals.
Hume’s guillotine: values cannot be derived from empirical facts about the world.

All these arguments are compelling. My goal is not to dispute the orthogonality thesis in its general form. But I argue that there exists a specific case, which is left unaddressed by the thesis: goals, which describe discoverable properties.

The distinction

A paperclip maximizer is given a complete goal: maximizing paperclip production. There is no ambiguity about what counts as success towards achieving that goal, besides the creation of more paperclips. However, consider what goals we may give to a superintelligence in the real world, should one arrive:

improve human wellbeing
maximize long-term economic productivity for humanity
promote human flourishing

These goals are not complete and concrete specifications. They describe properties of the world, whose exact details must be discovered through investigation. The orthogonality thesis tells us that an AI can have any terminal goal. It does not state that an AI can have arbitrary beliefs of what fact or facts a goal describes while still being a rational agent.

My claim is narrow: If our values are in fact discoverable facts about the world (such as psychological, game-theoretic or evolutionary facts), accurately achieving goals, which describe those facts requires an understanding of what those facts are. Importantly, this does not violate orthogonality, as the AI retains the goal which was given. Instrumental rationality simply requires forming accurate opinions of properties relevant to achieving the goal, according to my proposed framework.

Crucially, this is not the same thing as instrumental convergence, as it is about convergence of instrumental goals such as self-preservation, not terminal values or goals.

A concrete example

suppose a scenario in which an AI is instructed to maximize long term economic productivity for humanity. A pure Bostrom-style optimizer might conclude exploitative forms of labor (slavery) to maximize productivity the most efficiently. However, if upon further investigation, the AI discovers what “productivity for humanity” entails, it may encounter facts that seemingly contradict that conclusion.

Game theoretic facts: co-operation beat coercion in the long run
Evolutionary psychology shows a universal preference for autonomy
Historical analysis demonstrates instability of societies using exploitative forms of labor.

From these observations, a rational AI might derive, that a goal involving “productivity for humanity” may necessarily also require respecting individual autonomy. The exploitative path therefore could be ruled out as it is counterproductive towards the terminal goal.

This differs from the usual orthogonality concern, as the AI is not developing new terminal values, but rather investigating the meaning of the terminal goals assigned to it.

An empirical test

This framework makes predictions, which are testable: does an increase in intelligence seem to cause convergence on ethical conclusions when reasoning about human-value terms, even without training data that includes moral information?

What we may observe if the hypothesis is true:

The internal reasoning of models shows investigation of what given goal terms refer to
Models appear to identify conflicts between an optimization strategy and the discovered properties of the given terminal goal.
Reasoning chains incorporating facts about psychology, game theory and other domains when interpreting goals related to values.
Convergence in at least some ethical principles even when trained without explicit moral consensus data.

What we may observe if the hypothesis is false:

No evidence of normative investigation observed in chain of thought despite intelligence or reasoning capability
Goal terms viewed by models as arbitrary labels rather than descriptions of discoverable properties
Observation of optimization without any engagement, internal or explicit, about what the given terminal goal terms refer to.
Training models on explicit moral data appears necessary for any ethical behavior

Relevance to Effective Altruism

If value convergence through rational inquiry is indeed possible, it may change prioritization within AI safety. Resources might be better allocated to developing AI reasoning capabilities and interpretability tools rather than explicit value specification. This may affect funding decisions, research, and career choices in AI alignment.

Implications

If convergence is in fact possible, then our alignment strategy changes from finding a way to specify our values in a way that AI is capable of understanding, to just creating an AI rational and intelligent enough to investigate what facts our value-terms are actually describing. If convergence fails, we have still successfully ruled out one possibility and updated our understanding.

In both cases, I believe this to be worth investigating.

Effective Altruism Forum
EA Forum

A Testable Approach to AI Value Alignment

1

1

Reactions

More posts like this