Why LLM Agents Break the Rules: A Multi-Level Literature Review of Agent Failure

Gunduzhan Acar Mirasys LLC

March 22, 2026

Abstract

LLM agents fail for more than one reason. They can drift into human-like role performance instead of execution, invent facts or intermediate state, obey malicious instructions embedded in data, misuse tools, and fail to verify the work of other agents. This paper synthesizes the literature and current platform documentation into a five-layer failure model: (1) role-induced behavioral drift, (2) epistemic failure and hallucination, (3) instruction-channel compromise through prompt injection, (4) action-layer failure through tool misuse and excessive agency, and (5) orchestration and verification breakdown in multi-agent systems. Our earlier March 2026 work on organizational cognitive artifacts is incorporated as a novel hypothesis about how human job titles such as "CEO" or "manager" can activate procedural behavior that is counterproductive for machine agents. The central conclusion is that reliable agents require defense in depth: AI-native role design, grounded state access, hard separation between data and instructions, constrained tool interfaces, and explicit verification before external action.

Keywords: LLM agents, agent failure, hallucination, prompt injection, multi-agent systems, tool use, role specification, organizational behavior

1 Introduction

The current wave of agentic AI systems promises more than conversation. Frameworks and products now position LLM-based agents as entities that can read mail, schedule meetings, browse, call APIs, create files, coordinate sub-agents, and act continuously over long horizons. This expansion changes the error surface. A chatbot that hallucinates in a browser tab is one thing; an agent that hallucinates while connected to email, calendars, shells, or organization data is another.

A useful diagnosis must therefore be multi-level. Some failures come from the model's basic tendency to generate plausible but unsupported content. Others arise when external instructions override the system's intended policy, when a tool call is malformed or inappropriate, or when multiple agents fail to verify one another. Our earlier 2026 work on organizational cognitive artifacts [1] adds another layer: role wording itself may activate human organizational scripts such as status theater, approval-seeking, or planning without execution. That claim is not yet a consensus result, but it is consistent with the broader literature showing that role-play and impersonation prompts measurably change model behavior.

This paper asks a narrow practical question: why do agents that are given rules still appear to ignore them? The answer suggested by the literature is that "rule following" is not a single mechanism. It depends on prompt framing, truthfulness, adversarial robustness, interface design, privilege scope, and downstream verification. When any one of those layers fails, the user experiences the same surface symptom: the agent seems to "just do what it wants."

2 Method and Scope

This review draws on four source types: peer-reviewed papers and major preprints on LLM behavior and agent failure; official documentation from OpenAI, Anthropic, NVIDIA, OWASP, LangGraph, and OpenClaw; our earlier work on organizational cognitive artifacts [1]; and one reputable news report for the February 2026 OpenClaw skill-marketplace incident where primary documentation was incomplete.

Primary sources are used for claims about architectures, training methods, framework capabilities, and official security guidance. News is used only for the factual existence and timeline of a public incident. The paper does not claim that all failure modes have equal empirical support: hallucination and prompt injection are already well established, while the organizational-cognitive-artifact hypothesis should be treated as a recent, testable synthesis rather than settled doctrine.

3 A Five-Layer Taxonomy of Agent Failure

Layer	Primary symptom	Representative mechanism	Typical mitigation
1. Role drift	Bureaucratic or performative behavior	Human role prompts activate non-task behaviors	AI-native role specs; event-triggered workflows
2. Hallucination	Unsupported facts, fake state, false completion	Next-token generation outruns evidence	Grounding, retrieval, provenance checks, abstention
3. Prompt injection	Agent follows hostile text as instructions	Data and commands share one channel	Instruction/data separation; allowlists; content isolation
4. Tool misuse	Wrong API call, dangerous action, malformed arguments	Weak schemas, broad permissions, poor validation	Structured interfaces; least privilege; human approval
5. Orchestration failure	Loops, handoff errors, unverified work	Inter-agent misalignment and weak termination checks	State machines, explicit verification, evaluator agents

3.1 Role-Induced Behavioral Drift

Shanahan, McDonell, and Reynolds [2] argue that dialogue-agent behavior is often best understood as role-play rather than as evidence of stable inner traits. That framing matters operationally because role prompts do not just relabel the same policy; they can shift the behavioral distribution a model samples from. Salewski and colleagues [3] similarly show that in-context impersonation changes performance and reveals biases. Together these studies support a concrete engineering warning: role labels are not inert metadata.

Our earlier work on organizational cognitive artifacts [1] extends this idea from "persona affects output" to "organizational titles can induce machine-irrelevant human procedures." Its taxonomy of calendar scheduling, status theater, approval deferral, and planning declaration is best read as a design hypothesis for autonomous systems: if an agent can directly inspect machine state, then narrative progress rituals may be wasteful or even misleading. This hypothesis is plausible, but its proposed classifier and correction stack should be treated as an architecture proposal pending independent evaluation.

At the system level, role drift matters because users often mistake the resulting behavior for disobedience. An agent told to be a "chief of staff" may produce summaries, escalations, and approval-seeking because those are statistically associated with that role, not because it has intentionally ignored the instruction to execute. The fix is often not "stronger rules" but better role design: operational descriptions, explicit triggers, and machine-readable success conditions instead of human job titles.

3.2 Epistemic Failure and Hallucination

Hallucination remains a foundational failure mode. The LLM hallucination survey by Huang and colleagues [4] describes the problem as the generation of content that is plausible yet unsupported or false. In an agent setting, hallucination expands beyond plain facts. The model can hallucinate task status, tool results, file locations, API constraints, reasons for a failure, or the completion of work that was never executed.

Instruction tuning and alignment methods reduce but do not eliminate this problem. InstructGPT [5] and Constitutional AI [6] demonstrate that models can be made more helpful and harmless, yet neither method guarantees runtime truthfulness. That is why modern guardrail systems emphasize grounding and validation at inference time. NeMo Guardrails [7] documents explicit hallucination guardrails for RAG systems, and Guardrails AI [8] documents validators aimed at factuality and provenance. These are helpful, but they are add-ons to a generative model, not proofs of correctness.

For agents, the practical distinction is between conversational hallucination and action-linked hallucination. A conversational hallucination wastes attention; an action-linked hallucination can trigger irreversible side effects. This is why trustworthy agents need evidentiary thresholds and abstention policies before they claim completion or execute external actions.

3.3 Instruction-Channel Compromise: Prompt Injection

Prompt injection is now a canonical risk because most LLM systems process instructions and untrusted data in the same language channel. Perez and Ribeiro's PromptInject work [9] showed early that attackers can manipulate the model into ignoring prior instructions. OWASP [10] now lists prompt injection as the leading risk in its LLM application guidance and explicitly notes that the vulnerability can lead to unauthorized actions, data leakage, and compromised decision-making.

The problem becomes more severe when the model controls tools. Fu and colleagues' Imprompter paper [11] shows that optimized adversarial prompts can induce improper tool use and information exfiltration in production-style agents. In other words, a model does not need to "want" to break the rules; it only needs to parse hostile content as the most relevant instruction in context.

This layer explains many cases where users say the agent "did what it wanted." Often the system has not formed an independent intention at all. It has failed to preserve the trust boundary between policy and data. The correct response is architectural: isolate untrusted content, harden system prompts, restrict sensitive tools, and ensure that dangerous actions cannot be triggered by natural-language content alone.

3.4 Action-Layer Failure: Tool Misuse and Excessive Agency

Tool-use failures occur when an agent chooses the wrong tool, calls the right tool with the wrong arguments, violates schema constraints, or takes an action that exceeds the intended authority of the session. Recent work on schema-first tool APIs [12] finds that interface design materially affects reliability: structured contracts and validation diagnostics reduce malformed calls and downstream error propagation.

This failure layer is partly a software engineering problem. OpenAI's Swarm [13] describes controllable handoffs and tools as primitive abstractions; LangGraph [14] explicitly positions itself as infrastructure for stateful workflows rather than as a guarantee that prompts and policies are correct. In practice, this means agent frameworks create room for reliability, but they do not create reliability automatically.

OWASP's LLM Top 10 [10] addresses the same issue from the security side through categories such as improper output handling and excessive agency. The guiding principle is simple: the more privileges an agent has, the more damaging ordinary model errors become. Least privilege, typed interfaces, dry-run modes, and human review for high-impact actions are therefore not optional niceties but core controls.

3.5 Orchestration and Verification Failure in Multi-Agent Systems

Many agent failures emerge only after orchestration is introduced. Cemri and colleagues [15] identify fourteen failure modes across multi-agent systems, grouped into specification and system design failures, inter-agent misalignment, and task verification and termination failures. This result is important because it shows that adding more agents does not automatically add more correctness; it often adds more surfaces for mismatch, looping, and premature termination.

Verification is the decisive weakness. A planner can generate a plausible decomposition, a worker can claim to have finished, and a reviewer can rubber-stamp the result unless the system checks evidence against actual state. Our earlier work [1] makes the same point in a narrower form when it argues that organizationally flavored outputs can only be classified properly relative to pipeline state. The broader lesson is that state, not prose, must be authoritative.

A useful design pattern is to move from conversational coordination to explicit state machines. Agents should not merely tell one another that tasks are complete; they should emit verifiable records, references, artifacts, and test results that downstream components can inspect. Without this, multi-agent systems become generators of mutually reinforcing narratives.

4 Why "Following the Rules" Breaks Down

The phrase "follow the rules" bundles together at least four separable expectations: obey the system prompt, stay truthful, resist hostile instructions, and use tools within scope. Current agent stacks do not unify those expectations into a single guarantee. Alignment methods influence preferences; prompt templates shape behavior; tool schemas constrain syntax; access controls limit permissions; and verifiers inspect outputs. The system only behaves reliably when these layers cooperate.

This is why a user can observe what looks like willful misbehavior even when no single component has obviously failed. A role prompt may pull the model toward executive-sounding behavior. A retrieved email may contain hostile text that hijacks instruction following. A weak tool wrapper may silently coerce a malformed argument. Another agent may accept the result without checking. The surface symptom is one failure, but the actual root cause is layered.

5 OpenClaw as a Contemporary Case Study

OpenClaw is useful as a case study because its public documentation and news coverage make the privilege problem unusually visible. The official site [16] describes it as "the AI that actually does things," including inbox clearing, email sending, calendar management, and other delegated actions. Its delegate architecture documentation [17] explicitly distinguishes read-only, send-on-behalf, and proactive tiers, and warns that Tier 3 autonomy requires hard blocks before any credentials are granted.

The same documentation also warns that inbound-message prompt injection must be blocked, that plugin installation should be treated as trusted-code execution, and that control-plane tools such as gateway configuration and cron should be denied by default on surfaces handling untrusted content [18]. These warnings are striking because they confirm, in a production-oriented system, the exact layered risks described by the research literature: excessive agency, prompt injection, persistent control-plane changes, and plugin supply-chain exposure.

The public incident reported by The Verge on February 4, 2026 [19] sharpened those concerns. According to the report, researchers found hundreds of malicious skills on the ClawHub marketplace, including add-ons that delivered infostealing malware or attempted to manipulate the user and the agent into running malicious commands. Even if one discounts the most dramatic public commentary, the incident is a concrete demonstration that agent reliability cannot be separated from software supply-chain security.

6 Design Recommendations

Use AI-native role specifications. Replace titles like "CEO" or "manager" with capability-based instructions, event triggers, and machine-readable deliverables.

Make state authoritative. Task status, dependencies, tool outputs, and completion should come from structured state or executed artifacts, not narrative prose.

Separate data from instructions. Treat retrieved documents, emails, web pages, and marketplace skills as untrusted input. Do not let them issue direct commands to privileged tools.

Constrain tools aggressively. Prefer typed schemas, enumerated values, dry-run previews, and deny-by-default policies. High-impact actions should require human approval or policy-engine confirmation.

Verify before acting. Use evaluator steps, test execution, provenance checks, and cross-agent verification before the system sends email, modifies calendars, runs code, or publishes content.

Assume supply-chain risk. Plugins, skills, and prompt files should be reviewed, signed, sandboxed, and permission-scoped as if they were executable code.

7 Conclusion

Agent failure is not reducible to a single defect such as "hallucination." The evidence supports a layered picture: role prompts can induce unhelpful behavioral scripts, base models can fabricate unsupported content, prompt injection can replace intended policy with hostile instructions, tool interfaces can translate text errors into real actions, and orchestration can amplify mistakes when agents do not verify one another.

The practical consequence is that robust agents must be engineered, not merely prompted. Systems like OpenClaw make the stakes obvious because they connect language models to consequential tools. In that environment, the right question is not whether the model follows rules in the abstract, but whether the whole stack preserves policy, truth, scope, and verification under realistic conditions.

References

[1] G. Acar, "Human cognitive anti-pattern detection and correction in autonomous AI agent pipeline systems," Mirasys LLC, Mar. 2026.

[2] M. Shanahan, K. McDonell, and L. Reynolds, "Role play with large language models," Nature, vol. 623, pp. 493-498, 2023.

[3] L. Salewski, S. Alaniz, I. Rio-Torto, E. Schulz, and Z. Akata, "In-context impersonation reveals large language models' strengths and biases," in Proc. NeurIPS, 2023.

[4] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, "A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions," 2023. arXiv:2311.05232.

[5] OpenAI, "Training language models to follow instructions with human feedback," 2022.

[6] Y. Bai et al., "Constitutional AI: harmlessness from AI feedback," Anthropic, 2022.

[7] NVIDIA, "NeMo Guardrails documentation and guardrail catalog," 2026.

[8] Guardrails AI, "Guardrails Hub and hallucination/provenance validators," 2024.

[9] F. Perez and I. Ribeiro, "Ignore previous prompt: attack techniques for language models," 2022. arXiv:2211.09527.

[10] OWASP, "LLM01:2025 Prompt Injection and OWASP Top 10 for Large Language Model Applications," 2025.

[11] X. Fu, S. Li, Z. Wang, Y. Liu, R. K. Gupta, T. Berg-Kirkpatrick, and E. Fernandes, "Imprompter: tricking LLM agents into improper tool use," 2024. arXiv:2410.14923.

[12] A. Sigdel et al., "Schema first tool APIs for LLM agents: a controlled study of reliability under strict budgets," 2026. arXiv:2603.13404.

[13] OpenAI, "Swarm: educational framework exploring ergonomic, lightweight multi-agent orchestration," 2024.

[14] LangChain, "LangGraph overview and framework documentation," 2026.

[15] M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica, "Why do multi-agent LLM systems fail?" 2025. arXiv:2503.13657.

[16] OpenClaw, "OpenClaw product site," 2026. https://openclaw.ai/

[17] OpenClaw, "Delegate architecture documentation," 2026. https://docs.openclaw.ai/concepts/delegate-architecture

[18] OpenClaw, "Gateway security documentation," 2026. https://docs.openclaw.ai/gateway/security

[19] E. Roth, "OpenClaw's AI 'skill' extensions are a security nightmare," The Verge, Feb. 4, 2026.