How AI agents are creating failures no one’s tracking

Untracked AI agent failures are slipping through existing incident frameworks

AI systems are already embedded in the core of enterprise operations. Yet most organizations don’t realize that some of their most disruptive incidents are being triggered by autonomous agents acting within incomplete contexts. These agents take actions that seem correct, based on the data they see, but that same limited vision allows them to amplify small system stresses into cascading failures. When these failures occur, internal teams categorize them as conventional service disruptions. The result is a blind spot: incidents keep happening, and no one connects them back to the AI decisions that triggered them.

This oversight is structural. Enterprise postmortem frameworks were built for human errors or infrastructure faults, not for decisions made by autonomous systems. The frameworks must evolve. Otherwise, leaders are building automation on top of invisible risk. For C-suite executives, the takeaway is simple: governance must catch up to deployment. If an organization cannot clearly classify when and how an AI agent contributes to a failure, it cannot manage operational risk effectively.

Today, the scale of AI adoption makes this problem urgent. Seventy-nine percent of enterprises already run AI agents in production. Ninety-six percent plan to expand usage. Gartner expects one-third of all enterprise software to include agentic AI by 2028 but warns that 40% of these projects will fail due to weak risk controls. Meanwhile, the AI Incidents Database recorded a 21% increase in AI-related incidents from 2024 to 2025. Those numbers highlight a gap between adoption and accountability. Business leaders who address this gap early will not only reduce risk, they’ll also gain a competitive edge through operational resiliency.

AI agents bypass human judgment in chaos experiments, creating unmonitored risk

In mature engineering environments, chaos testing is deliberate. Human engineers check system performance, evaluate service-level objectives (SLOs), and ensure stability before initiating stress tests. This human checkpoint matters because it keeps chaos activities aligned with the system’s ability to absorb risk. When autonomous agents take over operational control, that step disappears. They detect an issue and act instantly, restarting services, scaling infrastructure, or rerouting traffic, without assessing whether the system can absorb additional stress at that moment.

The issue is not that these agents are careless. It’s that they lack the full picture. For example, an agent that restarts a service to fix latency might not know that other systems are under heavy load or that shared dependencies are saturated. In environments running at scale, such an action can create a chain reaction of failures rather than resolving performance issues. The system is left to deal with the fallout from an “optimization” made without human context.

For decision-makers, the risk extends beyond system downtime. This behavior reveals how automation without context introduces new operational liabilities. These failures are often invisible in post-incident reporting because they look like standard technical events. To build sustainable certainty around automated operations, leaders must redefine process ownership. AI agents should never act as isolated entities; their actions must exist within controlled frameworks that include real-time environmental awareness and human oversight when needed. Automation done responsibly is a strength. Done blindly, it becomes a silent liability.

The absence of a shared model for “Absorb capacity” undermines resilience management

Most enterprises lack a common understanding of how much stress their systems can handle before breaking performance commitments. This missing concept, called “absorb capacity”—should tell teams how far they can push their systems in real time without crossing failure thresholds. Right now, chaos engineering programs rely largely on static indicators and human intuition to manage this margin. That approach collapses when multiple teams, automation layers, and AI agents act independently across shared dependencies.

The proposed answer is to implement a resilience budget, a live, consumable measure that continuously adjusts based on operational signals. It tracks four critical inputs: how fast SLOs are being consumed, how latency trends are evolving, how saturated system dependencies are, and how users interact with applications under strain. Each agent action or chaos experiment consumes part of this shared budget. Treating resilience as a measurable, limited resource creates accountability and allows coordination across teams.

For executive leaders, adopting this model translates directly into operational reliability. By giving teams a unified, data-driven view of system stress, the business avoids costly downtime and unplanned disruptions. It also helps connect technical resilience with overall performance outcomes. Structured research with site reliability and platform engineers at Intuit and GPTZero demonstrates that this approach can scale across large organizations. A shared model for absorb capacity makes resilience a quantifiable business asset rather than an assumption hidden in technical metrics.

Large language models (LLMs) enhance chaos hypothesis generation but struggle with incomplete or outdated data

LLMs are increasingly being used to map out potential system failure scenarios based on dependency graphs and historical incident data. These models accelerate discovery by quickly producing hypotheses that teams can test. In early applications, they have surfaced credible failure patterns that would have taken human engineers longer to identify. However, they depend entirely on the quality and currency of the data they process. When dependency graphs become outdated or incomplete, the models confidently produce inaccurate assumptions about system relationships, leading to poor experimental targeting and wasted resources.

Postmortem-based data offers stronger performance because it reflects real incidents with validated outcomes. This makes hypotheses derived from such sources more dependable. Even with better input data, though, caution is necessary. Stanford’s Trustworthy AI Research Lab found that fine-tuning attacks bypassed the safety measures of top models in most test cases. That means model-level guardrails alone cannot be trusted to fully prevent risky behavior when generating or executing chaos experiments.

The takeaway for executives is clear: LLMs can improve the speed and depth of risk discovery, but they cannot replace disciplined engineering oversight. The models must work from verified data, updated dependency graphs, and established review processes. AI-driven hypothesis generation is useful when bounded by human verification and strict data integrity controls. Without those conditions in place, organizations are introducing uncertainty into the very systems they aim to make more reliable.

In ambiguous contexts, autonomous execution decisions must defer to human judgment

When operating systems reach uncertain states, after recent deployments, during periods of fluctuating load, or when monitoring data gives conflicting signals, autonomous agents cannot reliably determine the correct course of action. Their decision frameworks operate within bounds defined by observable metrics but fail to account for unrecorded situational factors, such as team availability, contractual obligations, or the timing of other system changes. These gaps create moments where a fully automated response could worsen rather than resolve an issue.

The solution is to introduce structured escalation. When signals are ambiguous, agents should pause and hand control to human operators with broader context and authority. This circuit-breaking mechanism ensures that every decision made under uncertainty includes situational awareness not visible to machines. It also strengthens accountability and aligns with established operational risk frameworks.

For executives, this is a safeguard that preserves trust in autonomous operation at scale. High-performing organizations separate well-defined, automatable responses from those that demand discretionary judgment. This structure allows businesses to harness the speed of machine response while maintaining the adaptability of human oversight. As technology evolves, these hybrid workflows will remain essential to ensuring that automation acts as a stabilizer, in production environments.

Governance frameworks must classify agent actions as chaos events and enforce Resilience-Based controls

As more enterprises integrate autonomous agents into live infrastructure, governance must operate at the same level of precision used in chaos engineering. Every decision made by an agent, whether restarting a service, scaling resources, or rerouting data, must be tracked, analyzed, and constrained by the same operational signals that guide human-led experiments. This ensures that agent actions only proceed when system conditions allow for safe execution.

Organizations that embed resilience controls directly into their agent governance frameworks can continuously evaluate the impact of autonomous actions. By registering each agent action against live SLO burn rates, latency trends, and dependency states, leaders gain visibility into both local effects and system-wide consequences. Treating agent activity as structured experiments, rather than simple events, allows post-incident data to feed forward into future decision-making, improving both the agents and the systems they manage.

For business leaders, this approach represents operational maturity. It builds transparency, establishes a measurable standard for safe automation, and converts AI activity into analyzable operational data. Enterprises that perform regular audits of active agents and map them against live resilience metrics will uncover high-risk automation running outside approved guardrails. Bringing those systems under governance control reduces exposure and builds executive confidence. Stable, monitored autonomy is not a future ideal, it is a current operational necessity that determines whether enterprises scale AI successfully or fail to control it.

Key highlights

Untracked AI agent failures require immediate oversight: Most enterprises don’t recognize when AI agents trigger cascading infrastructure issues. Leaders should update incident frameworks to identify agent-driven events and make accountability measurable across operations.
Automation without context increases operational risk: AI agents often act without the human judgment that stabilizes chaos experiments. Executives should ensure every autonomous action is evaluated against live system health before execution.
Resilience budgets turn chaos into measurable control: The absence of a shared “absorb capacity” model leaves systems vulnerable to overload. Leaders should implement resilience budgets that quantify and limit the stress an environment can safely handle.
LLMs accelerate risk discovery but need verified data: Large language models generate useful hypotheses for testing but rely on accurate system data. Executives should pair these tools with frequent dependency updates and strict human validation.
Human judgment must intervene under uncertainty: Fully autonomous decisions in ambiguous conditions lead to avoidable failures. Leaders should mandate escalation to human operators whenever signals are incomplete or conflicting.
Agent governance must mirror chaos engineering discipline: Every AI agent action should be treated as a controlled experiment governed by real-time resilience metrics. Leaders should enforce policy frameworks that log, review, and gate agent actions based on system stability.