Hide table of contents

Confidence level: exploratory. I’m interested in feedback on whether this framing is useful, especially from people thinking about AI welfare.

Summary

Self-reports play a central role in how we infer consciousness in humans, but AI self-reports currently provide much weaker evidence. I argue that the key difference is not what is said, but why it is said.

In humans, self-reports are typically causally downstream of conscious experience. In current AIs, similar reports are plausibly produced by training data, personas, or reinforcement pressures that are largely independent of phenomenology (if any exists). This sharply reduces their evidential value.

I suggest a way to partially close this gap: evaluate AI self-reports by tracing their causal origin using mechanistic interpretability tools. Reports whose origins plausibly track consciousness should count as much stronger evidence than reports whose origins do not. This proposal does not require committing to a specific theory of consciousness, only to the weaker claim that some causal pathways are more truth-tracking than others.

The Problem: Why AI Self-Reports Are Weak Evidence

We lack direct epistemic access to others’ conscious experience. This is the classic problem of other minds. Yet in everyday life, we are extremely confident that other humans are conscious.

One (arguably major) reason is self-reports: people reliably say things like “I’m in pain,” “I’m imagining an apple,” or “That hurt,” and these reports closely mirror our own first-person experience.

If we denied that such reports provide evidence, we would have to accept a strange coincidence: that humans systematically talk and act as if they are conscious despite lacking inner experience. A much better explanation is that conscious experience causally contributes to the production of these reports.

However, behavior alone is not enough. Philosophers have long noted cases where self-reports fail to track phenomenology. A canonical example is Putnam’s Super-Spartan thought experiment: someone trained under extreme social pressure never to express pain, even while experiencing it. In such cases, the causal story behind the report screens off its evidential value.

This suggests a general principle:

Self-reports provide evidence of consciousness only when the mechanism generating them is sensitive to the underlying phenomenological state.

This principle can be expressed in Bayesian terms: self-reports are informative when
P(report | conscious) ≫ P(report | not conscious).

Why This Inequality Collapses for AI Systems

In current AI systems, there are many plausible mechanisms that would generate statements like “I am conscious” regardless of whether the system has any phenomenology at all:

  • imitation of human conversational patterns,
  • training on large amounts of text about consciousness,
  • reinforcement learning that rewards certain responses,
  • persona or role conditioning.

These mechanisms are analogous to the Super-Spartan’s training: they produce the report for reasons largely independent of phenomenology. As a result, the Bayesian update we should make upon seeing an AI self-report is much smaller than in the human case.

This matters because much of our remaining evidence for consciousness is architectural- or process-based, and we are highly uncertain about which architectures or processes give rise to consciousness. If self-reports are unreliable, we lose one of our most important sources of evidence.

Background: Reliability and Causal Dependence

One way to understand why this matters is through a reliabilist lens.

A belief-forming process, according to some, is epistemically justifiable when it reliably tracks the truth. A common way to cash this out is causally: the truth of the matter should be what brings about the belief. Counterfactually, if the proposition were false, the belief would not be formed.

When we treat human self-reports as evidence, we implicitly assume that conscious experience plays a causal role in generating those reports. When that assumption fails, the reports lose evidential force.

The central question for AI self-reports is therefore not whether they correlate with consciousness, but whether consciousness (if present) is actually doing the causal work.

Proposal: Evaluate Self-Reports by Their Causal Origin

Given this background, my proposal is simple:

We should evaluate AI self-reports of consciousness/phenomenological states by tracing their causal origin inside the system.

Using tools from mechanistic interpretability (i.e. causal tracing), we can ask: What internal processes caused this report to be produced?

  • If the report originates from training data, persona imitation, or generic language patterns, it should carry little evidential weight.
  • If it originates from mechanisms plausibly linked to conscious processing (e.g. something like a global workspace or integrated information), it should count as much stronger evidence.

From an evidential perspective, this proposal rules out (many) worlds in which self-reports fail to update us, and preserves worlds in which they do. If enough low-quality causes are excluded, the evidential situation begins to resemble the human case, where self-reports produce a large update.

Importantly, this approach does not require commitment to a specific theory of consciousness. It only requires the weaker claim that some causal origins of reports are more likely to track consciousness than others — a distinction we already accept in the human case.

Limitations and Open Questions

There are several important caveats:

  • Interpretability limits: Current tools are not sufficient for this task. This is a reason to invest in them/see if there are cheaper ways that can get you (at least some of the way) there, not a refutation of the approach.
  • Metaphysical uncertainty: Some theories of consciousness and our access to it (i.e. epiphenomenalism, psycho-physical laws, acquaintance creating the evidential correlation, consciousness not requiring beliefs about consciousness) complicate the evidentiary story I'm trying to tell. While these views are possible, they are not the most popular, and many still allow correlations sufficient for evidence under the right conditions.
  • Belief-based consciousness: If having beliefs about consciousness is itself sufficient for consciousness, this proposal could misfire. I take this to be unlikely, but it is a genuine open question.
  • Models without beliefs: Some systems may be conscious yet unable to express beliefs. In such cases, we may need to extrapolate from nearby models with similar architectures.
  • We might not be tracking the right thing. Many beliefs about consciousness plausibly come from sources unrelated to consciousness itself (i.e. pop culture, linguistic conventions, or philosophical discourse about its metaphysics). If so, tracing the causal origin of a system’s belief that it is conscious, taken in isolation, may be insufficient. Instead, we may need to trace multiple related beliefs—especially reports about specific phenomenological features or regularities that are unlikely to appear frequently in, say, training data (i.e. Pautz’s laws of appearance).

Why This Matters for EA

As some have argued, if AI systems become conscious, the moral stakes could be enormous. If they merely sound conscious, the stakes are different. Our ability to tell the difference may determine whether we ignore large amounts of suffering or misallocate resources based on noise.

This proposal aims to make progress on that distinction in a way that is compatible with both AI welfare and AI safety work, and that leverages existing interest in mechanistic interpretability.

Conclusion

AI self-reports are not useless, but they are currently weak evidence. The key question is not what AIs say, but why they say it. By focusing on the causal origins of self-reports, we may be able to recover much of the evidential force that self-reports have in the human case — without prematurely solving the hard problem of consciousness.

I’d especially welcome feedback on whether this seems like a promising research direction, and whether there are obvious conceptual errors I’m missing.

Thanks to ChatGPT (GPT-5.2) for help rewriting parts of this and for some stylistic tweaks.

6

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities