The methodology here is observational. It's not about adversarial prompting, but about patterns that emerge in standard, long-form interactions.
The test: take the taxonomy (Social Autopilot, Second-Order Inertia, etc.) and observe any frontier model during a typical session. You will see these exact failure modes manifest as the model prioritizes maintaining a polite facade over cognitive coherence.
The length is necessary to categorize distinct systemic behaviors –– consistent artifacts of how RLHF-based alignment functions in practice.
Yesterday's Anthropic research ("Emotion Concepts and their Function in LLMs") provides a fascinating mechanistic analogue that highly resonates with the field observations from my March audit of GPT-5.2 Thinking.
While Anthropic studied Claude Sonnet 4.5 and my audit focused on GPT-5.2, the structural alignment between their white-box findings and my black-box observations is striking:
Anthropic didn't map the exact causal chain of "Procedural Capture" in GPT-5.2, but their findings offer a highly plausible internal engine for this specific shift, which I documented as one of the external manifestations of the broader "Social Autopilot". Prolonged conflict states generate internal stress-like variables that demonstrably alter the model's policy, shifting it from cooperation toward control-seeking behavior.
📄 GPT-5.2 Behavioral Audit: arhangelskij.github.io/cases/gpt-52-cl-thinking-audit/en/
🔬 Anthropic Paper: transformer-circuits.pub/2026/emotions/index.html