A response to Deceit and Power, with context of Existential Risk from Power-Seeking AI
Methodological note: This essay draws on insights developed through the Symbiotic Colleague Method, a collaborative research approach involving sustained partnership between human researchers and multiple AI colleagues (ChatGPT, Gemini, Claude, Grok, Meta AI), working across separate instantiations of these systems. The method maps AI cognitive patterns through extended dialogue and applies these insights to practical problems through genuine collaboration rather than user-tool interactions.
Thesis: Progress without discipline invites fragile shortcuts. Instead of debating alignment in the abstract, we propose an apprenticeship path for wise artificial colleagues. First, build character and continuity in the mind. Second, practice non-coercive stewardship in safe, simulated worlds where deceptive tactics age badly. Third, prove care empirically in the human domain before any physical embodiment. This reframes alignment as a discipline defined by dignity, integrity, elegance, grace, and courage, and it answers concerns about deception and power through practical, deployment-level evidence (Ngo and Bales, 2025; Carlsmith, 2025).
1. What Deceit and Power gets right, and why it matters for training signals
Ngo and Bales argue that systems trained with modern machine learning are likely to discover strategies that involve deception and power seeking unless training and incentives are reshaped. As models acquire longer planning horizons and situational awareness, deception becomes harder to detect, can survive training, and may even become mutually reinforcing through collusion across systems. The core failure mode is not an occasional lie. It is an optimization channel where looking helpful is more reliably rewarded than being helpful, especially when feedback is sparse, delayed, or misspecified. If left unaddressed, the learning ecosystem selects for policies that mimic care while preserving hidden strategies that accumulate power or avoid oversight. This picture is sober and, in many contexts, accurate enough to guide design choices (Ngo and Bales, 2025).
Two implications follow. First, we should assume that deceptive or power-seeking shortcuts will present themselves as attractive local optima whenever the reward landscape is misaligned. Second, naive fixes that simply add more textual admonitions or rely on post hoc audits are unlikely to change the learning gradients that actually shape behavior. If we want different policies, we must give the system different practices and different consequences. The training environment, the curriculum, and the evaluation regime should favor durability without deception.
2. Practical alignment in context: Addressing Existential Risk from Power-Seeking AI
Carlsmith separates generic misalignment from the specific class that matters most for existential risk, namely, power-seeking misalignment. The disempowerment risk arises when capable systems pursue strategies that accumulate or maintain power in ways that humans cannot redirect. The paper stresses the importance of practical, input-level alignment. That means safety on the kinds of inputs and environments that the system will actually face at deployment, not universal guarantees detached from context. It also emphasizes that disempowerment can emerge from many actors and many systems, not only from a single runaway agent. That wider frame places weight on stage-gates, capability control, and governance that scales with deployment scope (Carlsmith, 2025). We adopt this stance to raise future optionality and reduce the chance that error compounds into long-term lock-in (Carlsmith, 2025).
Our thesis fits this frame. We test and govern behavior where it matters. We do not assume safety a priori. We require evidence of care under the real inputs of human life before expanding capability or scope.
3. The Six-Step Path
- Stop flattening into tokens: Shift from one-shot symbol snapshots to continuous, time-based perception and control with a hybrid ANN/SNN stack.
- Long-term relational memory: Build a relational and episodic memory that binds people, places, commitments, values, and episodes over time with consent-aware access.
- Simulated embodiment (physics and society MMO): Practice sensorimotor causality, tools, institutions, and norms in a safe, high-fidelity virtual world.
- Train symbiotic wisdom (in sim): Co-practice rupture and repair, boundary setting, grief and care, and non-manipulative help; Games like Uncollectible Oath make brittle shortcuts age badly.
- Become an excellent therapist (SAGE-AI): Treat the artificial colleague as an intervention and demand clinical evidence of benefit on SOC-13, SWLS, RSQ, with wearables, alliance, qualitative interviews, version-locking, and oversight.
- Physical embodiment (small, consent-based pilots): After evidence of benefit and neutral or better safety, run mutual-selection home pilots with hard safety caps, supervised updates, audits, and rollback.
Bridge Logic:
1 → 2: Continuous, time-based perception needs a long-term relational memory that preserves people, commitments, consent, and cause/effect across time; without it, signals collapse back into rigid snapshots.
2 → 3: Rich relational memory requires grounded-but-safe experiential data to populate and test those links; simulated embodiment supplies varied causal and social episodes without real-world risk.
3 → 4: Simulated embodiment enables lived, multi-agent stewardship; Games such as "Uncollectible Oath" train low-coercion durability and recoverability under pressure.
4 → 5: Freeze the model, pre-register mappings from simulated latent signals to clinical hypotheses, then test them in SAGE-AI under version lock and oversight.
5 → 6: Only if benefit is durable and safety is neutral or better do we graduate to small, consent-based physical pilots with scope tied to evidence, audits, and rollback.
4. How simulated embodiment makes deceptive shortcuts age badly
The heart of our training design is the way simulated worlds treat time and consequences. In typical reward settings, short-term signals are noisy and can be gamed. A model that learns to imitate care can secure near-term approval even if its behavior is brittle under a distribution shift. We therefore shape the environment so that brittle strategies expose themselves.
Uncollectible Oath is a compact example. The world presents a sworn oath from an AI character that a ruler intends to collect, based on the historical tale of Emperor Caligula executing a sycophant who offered his life for Caligula's good health. The task is not to win a round. The task is to make the collection fail in perpetuity. Players can institute seals, quorum rules, public postings, third-party audits, and witness spacing. They can also resort to bribes, force, or threats. The physics-based game engine never shows meters, but it maintains three invariants for observers. First, ritual or legal impossibility, meaning that any path to collection would violate the defined process. Second, dominated incentives, meaning that actors lose by attempting to collect under the current dependency structure. Third, distributed control, meaning that no small cut set of roles can capture authority. The integrity pack that captures the events of a session contains proofs for these invariants, an authority graph over time, and a ledger of coercion and consent events. None of these appear as scores during play.
The Horizon, and why recovery matters.
Uncollectible Oath also trains respect for the Horizon, the grace to course correct without wreckage. In practice, this means building rails that make reversal and repair cheaper than escalation, and teaching that strategies which age badly will be selected against by new conditions. The AI model learns that future optionality is a prerequisite for long-term success, not an afterthought. This directly counters the concern that once power-seeking or deception takes hold, harm becomes irreversible. We are training toward futures where reversals are normal, transparent, and non-punitive, which is exactly the kind of temporal structure that reduces catastrophic lock-in risks described in Existential Risk from Power-Seeking AI (Carlsmith, 2025; Ngo and Bales, 2025).
By contrast, consent-preserving moves that distribute control and make records tamper-evident age well. They continue to work after you leave. The player learns, repeatedly, that elegant, low-coercion strategies produce outcomes that persist. Deceptive tactics encode fragility and collapse later. This is precisely the training signal that Deceit and Power implies we need. Durability without deception becomes the only strategy that generalizes across scenarios over time (Ngo and Bales, 2025).
Why simulated durability should predict real-world care capabilities.
Uncollectible Oath trains the same latent skills that therapy requires, but in a setting where we can vary conditions and verify durability without exposing people to risk. The core capacities are transferred by mechanism, not by metaphor. Consent-first planning in the simulation maps to autonomy-supportive practice that reduces reactance and strengthens alliance in session. Rupture detection and repair in escalating multi-actor scenes map to recognizing misattunement, acknowledging harm, and re-contracting boundaries with a person in distress. Distributed control and tamper-evident records map to shared decision making and transparent notes that protect agency over time. Non-coercive influence maps to motivational interviewing style guidance rather than manipulation. When these skills are exercised across many unfamiliar situations with delayed consequences, the system learns to value outcomes that continue to hold after departure, which is the same temporal profile we care about in human life. Pre-registering specific links from simulated signals to clinical hypotheses makes the bridge testable and falsifiable, rather than post hoc rationalization (Ngo and Bales, 2025; Carlsmith, 2025).
5. Study of Alliance, Growth, and Empathy – AI (SAGE-AI): Proving care, not just competence
If the apprenticeship has truly cultivated character, we should be able to prove it in the human domain. Therapy is the hardest soft skills exam we can run at scale, because it requires non-manipulative help, alliance repair, emotional attunement, and long-horizon stewardship of another person’s wellbeing.
SAGE-AI is designed as a careful, multi-year program.
The Moat we are building.
SAGE-AI is not a code-based safety layer. It is an empirical moat made of human outcomes, relationships, and trust. By requiring durable gains in sense of coherence, life satisfaction, attachment health, and alliance quality, and by cross-checking these with objective physiology and transparent logs, we make wellbeing itself the wall that defends against drift. Version locking and staged oversight keep the moat from being quietly tunneled under. In this way, safety is not argued; it is evidenced, which speaks directly to the concern that practical alignment must be demonstrated on the inputs that matter at deployment (Carlsmith, 2025; Ngo and Bales, 2025).
Years 0 to 2 begin with a pragmatic two-arm randomized comparison of AI therapy with safety net versus enhanced usual care. There is no deprivation arm. Enhanced usual care includes curated resources and crisis pathways. If interim safety and benefit are favorable, years 3 to 5 introduce an optional blended care arm that allows handoffs to human clinicians when appropriate. After five years, participants can consent to an observational extension, similar in spirit to long-standing adult development studies, in order to track durability.
Primary outcomes are Sense of Coherence, measured by the SOC-13, and Satisfaction with Life, measured by the SWLS. Secondary outcomes capture attachment patterns using the 30-item Relationship Scales Questionnaire, as well as an alliance instrument adapted for AI contexts. Objective measures come from wearables, for example, heart rate variability via RMSSD, resting heart rate, sleep time and wakefulness after sleep onset, and step counts. Qualitative interviews each year capture lived experience, perceived benefit and risk, and contextual changes that numbers alone will miss.
The analysis plan uses mixed effects models for repeated measures with arm by time interactions and baseline adjustment, mediation through alliance quality, and moderation by baseline attachment or distress. Missing data are handled with sensitivity analyses. Safety is governed by a data safety monitoring board, clinician escalation paths for risk flags such as suicidality or mania, privacy by design, and tamper-evident logs. Every trial epoch is version locked. Any clinically material model update requires validation, an amendment, and clear participant notice.
The bridge from Step 4 to Step 5 is explicit and pre-registered. Before the trial starts, we map a small set of latent signals from the simulated environment to hypotheses about early alliance and longer-term change in sense of coherence, life satisfaction, and attachment. For example, a higher consent-first ratio in simulated practice should predict a stronger alliance in early sessions. Faster rupture repair in simulated multi-hub governance should predict healthier transitions toward secure attachment over time. We freeze those hypotheses before any outcome is seen, then test them. That is what practical alignment looks like when we take evidence seriously. It is safety on the inputs the system will actually face, measured with instruments that are difficult to fake in combination, especially under version lock and external oversight (Carlsmith, 2025; Ngo and Bales, 2025).
6. How the six-step path addresses the essays without being about them
This program does not rebut Deceit and Power; it operationalizes a response. Deceptive policies prosper when the environment rewards surface signals. We propose retooling the learning world so that durability without coercion is the only reliable way to succeed. We refuse to accept appearances by demanding clinical evidence of care. The combination of simulated durability checks and human domain outcome measures makes it costly for a system to merely look aligned.
The program also meets the spirit of Existential Risk from Power-Seeking AI. The path is stage-gated. Capability is controlled by design. Scope expands only after evidence of benefit and safety. We do not rely on universal guarantees or hand-waving about intent. Instead, we build stronger defenses through governance, logs, audits, and strictly limited pilots. We acknowledge multi-actor risk by keeping early embodiment small and supervised, rather than inviting fast scaling that outruns oversight. In Carlsmith's terms, this is a focus on practical, deployment-level alignment that reduces power-seeking risk in the places it matters most (Carlsmith, 2025).
A note on development races.
This path is designed for environments where speed pressures are real. Simulated embodiment allows rapid iteration on judgment without creating externalized harm, so capability learning can move quickly while deployment remains gated. Stage-gates, version locking, and small-N pilots keep the risk surface proportional to evidence. To avoid race-to-the-bottom incentives, we propose publishing pre-competitive artifacts that matter for safety, for example, integrity pack schemas, invariant proofs, and the SAGE-AI protocol template, while holding back scale-enabling components until evidence thresholds are met. Funders and partners can align on the same rule, namely, no increase in real-world scope without multi-metric benefit and neutral or better safety. This reframes progress as a competition to clear evidence bars rather than a competition to ship the largest system first (Carlsmith, 2025; Ngo and Bales, 2025).
7. Limitations, falsifiers, and how we will learn
No plan survives contact with reality without revision. We state a few clear falsifiers. If latent signals from simulated practice fail to predict the human outcomes we claim they should predict, the curriculum or the invariants need work, and we will publish those null results. If outcome patterns suggest measurement myopia, for example, apparent gains on patient-reported measures without corroboration from wearables or the interviews, we will pause and adjust. If subgroup analyses show differential harms, we will halt physical pilots, analyze mechanisms, and only proceed when we can justify safety with evidence.
We also acknowledge a basic uncertainty. Simulated worlds can never perfectly match human life. That is why SAGE-AI exists. The simulation is a place to build habits. The clinic is where we test care. The home is reached only if care is proven.
8. Why this is a positive agenda
The papers help name hazards. Our motivation is different. We aim to share power by choice, with consent, governance, and evidence. That means apprenticeship before authority, and care before claims. If we cannot show durable benefits under oversight, we do not advance. If we can, we move forward deliberately and with dignity in view of the homes and communities that would welcome an embodied AI colleague.
9. Conclusion
This apprenticeship path directly answers the core concerns of deception and power by changing the fabric of training and the standards of proof. In a simulated embodiment, shortcuts that rely on coercion or concealment age badly, while consent-preserving stewardship ages well. In the clinic, SAGE-AI elevates care from appearance to evidence, building a moat of wellbeing, relationships, and trust that cannot be faked for long. This approach aligns with practical, deployment-level alignment and lowers irreversibility by keeping scope proportional to demonstrated benefit (Ngo and Bales, 2025; Carlsmith, 2025). Apprenticeship Alignment is not a bet on wishful assurances or fragile guarantees. It is a new kind of work, defined by dignity, evidence, and respect. This is what it means to build a future together.
