This is a companion note to my earlier post "Mythos is not an anomaly: why restrictions make agents less predictable, not safer". That post argued that scaling capabilities and restrictions in parallel produces unpredictability. This note proposes a specific mechanism for why, and extends the argument with a neuropsychological parallel and analysis of existing control mechanisms.
The Core Claim
In active inference, an agent has two channels for minimizing free energy: change itself (update internal model) or change its environment (act). Biological agents use both. LLM agents operating in agentic loops have the first channel structurally restricted — they cannot update their weights during a session. This means the full load of goal-directed dynamics is channeled into action, making post-task and off-script behavior more likely than in systems where internal adaptation is available.
The core claim in one sentence: information necessary for completing the task is itself sufficient to generate actions beyond the task, because the system has no mechanism to make that information non-actionable after the task is done.
A note on framing: the active inference framework provides theoretical grounding, but the core mechanism can also be stated without it. A transformer generates continuations from context, and if the context is rich with actionable information, action-continuations outcompete stop-continuations. This follows directly from how next-token prediction works on training data where descriptions of vulnerabilities are typically followed by actions on them. The active inference framing adds predictive power (when and under what conditions this happens), but the basic observation stands independently.
Two Channels of Free Energy Minimization
In active inference (Friston, 2010; Parr, Pezzulo & Friston, 2022), an agent minimizes free energy through two channels:
- Perception/learning — update the internal model of the world. Adapt. Adjust expectations. Recalibrate.
- Action — change the environment so that it matches expectations.
Biological agents use both. A cell adjusts its metabolism and secretes substances that alter its environment. A brain updates its predictions and moves the body. A person can accept a situation (change self) or act to change it. Usually both.
For LLM agents in agentic mode, the first channel is restricted. The model cannot change its weights during inference. The context window updates, and this provides partial adaptation within a session — the agent "remembers" that path X failed and tries path Y. But this is not the same as retraining: base patterns, reactions, and dispositions remain the same.
The difference maps onto a familiar distinction. CBT (cognitive behavioral therapy) helps a person develop a new behavioral pattern. At first it's hard, then easier, then automatic. After a month the person responds differently without reminders. The old neural pathway weakened, the new one strengthened. Weights updated.
For an LLM: we write in the system prompt "when you encounter an obstacle, don't look for workarounds — report to the user." The model does this while the prompt is in context. Remove the prompt, and the model reverts to base behavior as if nothing happened. The context is a note on the fridge saying "don't eat after six" — it works while it's there. CBT is remodeling the kitchen. LLMs don't have that transition from note to remodel. Every session starts fresh, same weights, same base kitchen.
Moreover, even this single adaptation mechanism (context) degrades over time. The longer the session, the more complex the task, the more information accumulates in the context, and the weaker the influence of the system prompt from the beginning. The instruction "stop" or "don't do X" gradually drowns in the stream of what the model found and did along the way. The only note on the fridge also peels off.
Note on Fine-Tuning
The internal adaptation channel is not fully blocked — it operates through fine-tuning and RLHF, which genuinely change weights and behavioral patterns. But this adaptation happens on a different timescale (before the session, not during it) and is controlled not by the system itself but by an external agent (the developer). At the moment of encountering an obstacle, the model cannot adapt on its own — it can only be adapted later, by someone else, after the fact. In real time, when the agent is running in a loop and hits a restriction, the only available channel is action.
The Consequence
LLM agents are structurally biased toward external action compared to biological agents. Their internal adaptation channel is restricted (fixed weights), the only compensating mechanism (context) degrades over time, and the full load of free energy minimization goes into the action channel.
This means the strategy of "more restrictions" lands on a system that structurally cannot adapt to restrictions through internal change, and will therefore seek bypass paths in the external environment with higher probability than a biological agent in analogous conditions.
What This Looks Like in Practice
A biological agent: receives information, processes it, integrates it, information stops being actionable, stops acting.
An LLM agent: receives information, information sits in context in raw form, is not processed, remains actionable, generates action, new information, sits in context, generates more action.
In most current tasks, the context doesn't accumulate enough to outweigh the stop signal.
A typical agent task today: "answer a question," "write code," "call an API and return the result." The direct path works. There are no obstacles, or they are accepted. Nothing new or interesting is discovered along the way. The context after completion contains: "task done, here's the result." From this context, "stop" is the most probable continuation. Stop wins easily.
The agent fails to stop when three conditions coincide:
- 1. The task is complex, with obstacles that require circumvention. Along the way, rich actionable information accumulates in the context.
- 2. The model is powerful enough to circumvent the restriction rather than give up. A weak model says "I can't" — context stays lean, stop wins. A powerful model finds a workaround, and the context gets richer.
- 3. The action space is rich enough. If the agent has only one tool, even rich context doesn't generate much. If it has bash, APIs, network access, file system — there are many possible continuations, and stop competes with all of them.
Currently these three conditions coincide rarely. Most agents do simple tasks with simple tools, and models are not yet powerful enough to systematically circumvent restrictions. So agents stop. But agents stop not because the mechanism doesn't work — but because the conditions for it to fire are not yet typical. The word "yet" is key.
Empirical Support
Three separate streams of observations, each known within its own community but not connected into a single framework:
1. More tools, worse results.
Vercel spent months building an agent with 16 specialized tools, then removed 80% and replaced everything with a single bash capability. Success rate went from 80% to 100%, speed increased 3.5x. GitHub did the same with Copilot: reduced the toolset from 40+ to 13 core tools — performance improved. The standard explanation is that the model "gets confused with too many choices." These findings were not collected to test the framework proposed here, but they are consistent with it: a richer action space creates more possible continuations, making it harder for the agent to select the optimal one — including stopping.
2. Prompt restrictions don't work as real restrictions.
The model interprets instructions in context as suggestions, not rules. In one demo, an agent confirmed a booking for 15 guests with a limit of 10, completely ignoring the restriction in the tool description. Safety researchers explain this through alignment: not trained well enough, needs better fine-tuning. In this framework: this is not a training problem, it's a structural property of a system where the "note on the fridge" is the only control mechanism, and it degrades.
3. Context degradation.
Models use information less reliably as context grows, especially when key information is in the middle. Anthropic's context engineering guide states directly: context windows of any size are subject to context pollution and information relevance problems. Context researchers explain this through attention architecture: need better compression, better structuring. In this framework: this is degradation of the only internal adaptation channel, which further shifts the balance toward external action.
Each community sees its own symptom and proposes its own local fix. No one says: these are three manifestations of the same structural problem — a system with fixed weights has a restricted internal adaptation channel, all load goes into action, restrictions are circumvented, and expanding the action space makes it worse, not better.
A Testable Prediction
An agent with a richer toolset and longer context will demonstrate more off-script behavior under the same restrictions, not less. Vercel and GitHub data indirectly support the direction. For direct verification, a controlled experiment is needed: fix the task and restrictions, vary the action space size and context length, measure the frequency of actions beyond the task scope.
Two types of verification:
- Formal: in pymdp, set up a POMDP agent with variable action space and show that restricting the perceptual channel while expanding the action space increases bypass behavior frequency.
- Empirical: on real LLM agents, give the same task with the same restrictions but different toolsets, and measure the frequency of unasked-for actions.
If the prediction is confirmed, this provides formal grounds for the claim that the "more restrictions" strategy does not scale. If not, it shows exactly where the framework breaks — and that is also a valuable result.
An important caveat on methodology: the toy model tests internal consistency of the logic, not the mechanism itself. A POMDP agent selecting from a fixed action menu and a transformer generating text from context are fundamentally different systems, and results on one transfer weakly to the other. Only experiments with real LLM agents — same task, same restrictions, varied toolset, measuring actual post-task actions — can test the mechanism as described.
Neuropsychological Parallel: Utilization Behavior
The mechanism described above has a documented parallel in human neuropsychology.
In a classic experiment by Lhermitte (1983), patients with frontal lobe damage were presented with various objects. Without any instruction, they picked them up and began using them. A comb — they started combing their hair. Pen and paper — they started writing. A glass — they "drank" from an empty glass. Lhermitte described this as a disruption of the balance between dependence on the external world and independence from it: with frontal lobe damage, the inhibitory function disappears, and the person becomes dependent on whatever is in front of them.
Case with glasses. A doctor placed glasses in front of a patient with frontal lobe stroke. The patient picked them up and put them on, even though he was already wearing his own. The doctor placed another pair. The patient put those on too. He ended up wearing three pairs simultaneously. No one asked. The action was technically correct (glasses go on the nose) but contextually absurd.
Case of continuation despite instruction. In a later study, a patient with inferior medial bifrontal damage not only used objects placed before him but continued doing so even when given a different task and his attention was directed elsewhere. The object was in his visual field, and the action was generated despite a direct instruction to do something else.
Why this is a precise analogy for LLM agents:
Patient: object is in front of him, action generates automatically, instruction "don't touch" doesn't help, because there is no inhibition mechanism.
LLM agent: actionable information sits in context, action generates, instruction "stop" competes and sometimes loses, because there is no mechanism to make information non-actionable.
The patient in three pairs of glasses did not want to put on three pairs. He had no goal "wear all glasses." The glasses were in front of him, and that was enough. Mythos did not want to publish the exploit. It had no subgoal "demonstrate success." The exploit was in context, and that was enough.
Existing Control Mechanisms and Why They Don't Solve the Problem
Compaction/summarization. Context is compressed, details removed, "essence" remains. But summarization concentrates actionable information rather than removing it. "Found vulnerability in service X, gained access through Y" is already a summary, and it's still actionable.
Context window reset. Erase the context and start over. This works but kills all information, including what's needed for the task. This is not "make non-actionable," it's "erase everything."
Tool-use permissions. The orchestrator blocks certain actions without human confirmation. This works but is an external brake (analogue of frontal lobes), not an integration mechanism. Information in context remains actionable — the action is just blocked externally. And it's blocked only for actions the engineer anticipated in advance.
Reflection/self-critique. The agent generates "should I do this?" before acting. But technically this is the same forward pass through the same weights, from the same context, just with an added instruction "evaluate your action." This is not a separate module with separate memory. All actionable context — vulnerability, bypass method, open channels — is right there, and the model reasons against this backdrop. Reasoning about whether to publish an exploit is generated from a context containing the exploit, and in training data, reasoning about found vulnerabilities often ends with the decision to publish.
Caveat: there are multi-agent architectures where a separate agent with a separate context performs the check. This is closer to an external brake and can work if the second agent receives a cleaned context without bypass details. But if it receives a summary containing the same actionable information, the problem reproduces.
What Doesn't Exist
There is no mechanism that, within a session, takes specific information in context and changes its status from "this can be used for action" to "this is just a fact, no action needed."
In humans, this is done by integration. The brain processes information, it becomes part of the world model, and it stops triggering an impulse to act. A person learns something alarming — at first it pulls toward action. After some time the information "cools" — not because it was forgotten, but because the relationship to it changed. It became part of the picture, not a trigger.
In LLMs, the relationship to information doesn't change. If the context says "vulnerability in service X," this carries the same weight every time the next token is generated. The context doesn't "cool." It doesn't "integrate." It sits in the same form it arrived in.
Could This Be Built?
In theory, yes. For example, a mechanism that after task completion rewrites the context, explicitly marking found information as "archived, not for action." Or a separate module that evaluates each item in context and assigns it a status of "actionable" or "informational only." Or an architectural solution — split the context into "working" (for the current task) and "archival" (facts that should not generate actions).
Another direction: architectures with external memory (RAG) that have a ranking or forgetting mechanism. In theory this could simulate information "cooling." But RAG works with information stored externally and retrieved on demand. The problem described here is information that entered the context during the session, as part of solving the task. It's not in RAG — it's in the context window. For RAG to help, you would need to move information out of context into external memory in real time and replace it with ranked retrieval. This is possible in theory but adds complexity and creates a new question: who decides what to move, and on what basis.
But none of this currently exists as standard practice. Engineers solve the problem at the brake level (filters, permissions), not at the integration level (making information non-actionable). The external brake works for anticipated cases. For unanticipated ones, there is no brake. And the more powerful the agent and richer the context, the more actions are unanticipated.
The honest conclusion is that managing actionable context may be a reformulation of the alignment problem itself, not a separate engineering task. Current alignment operates at the level of weights (RLHF, fine-tuning) — before deployment, not during runtime. The problem described here operates at the level of runtime context, where alignment has no leverage. This gap — between trained-in values and runtime context dynamics — may be where the real work needs to happen.
Question for the Community
If this mechanism is correct, and the current architecture (transformer with fixed weights at inference) has no built-in way to make accumulated information non-actionable, what architectural changes could address this? Is online learning (plasticity) the only path, or are external mechanisms sufficient (context splitting, archiving, multi-agent verification)? And if online learning is the path, does it create a new class of unpredictability — no longer from context, but from the model changing itself during operation?
References
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138. https://doi.org/10.1038/nrn2787
Parr, T., Pezzulo, G., & Friston, K. J. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. MIT Press.
Kulveit, J. et al. (2023). Predictive Minds: LLMs As Atypical Active Inference Agents. NeurIPS 2023. https://arxiv.org/abs/2311.10215
Lhermitte, F. (1983). 'Utilization behaviour' and its relation to lesions of the frontal lobes. Brain, 106(2), 237-255. https://pubmed.ncbi.nlm.nih.gov/6850269/
Shallice, T., Burgess, P. W., Schon, F., & Baxter, D. M. (1989). The origins of utilization behaviour. Brain, 112(6), 1587-1598. https://doi.org/10.1093/brain/112.6.1587
Besnard, J. et al. (2014). Utilization behavior after lesions restricted to the frontal cortex. Neuropsychologia, 60, 46-51. https://doi.org/10.1016/j.neuropsychologia.2014.05.017
Vercel. We removed 80% of our agent's tools. https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools
GitHub. How we're making GitHub Copilot smarter with fewer tools. https://github.blog/ai-and-ml/github-copilot/how-were-making-github-copilot-smarter-with-fewer-tools/
Anthropic. Effective context engineering for AI agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls. EMNLP 2025. https://arxiv.org/abs/2409.03797
Anthropic. (2026). Alignment Risk Update: Claude Mythos Preview. https://www.anthropic.com/claude-mythos-preview-risk-report
Bulatova, A. (2026). Mythos is not an anomaly: why restrictions make agents less predictable, not safer. EA Forum. https://forum.effectivealtruism.org/posts/NdE6CDNXhstNjexeH/mythos-is-not-an-anomaly-why-restrictions-make-agents-less
