When automation breaks more than it fixes

Automation transforms rather than simply substitutes human work

There’s a dominant narrative in the tech world that automation equals efficiency because software can just take over what humans do. That’s massively oversimplified, and can be dangerous if you’re making decisions based on it. The idea that machines can simply “replace” human functions by doing the same tasks faster or with fewer errors misses the point. When you bring automation into a system, you don’t just duplicate the work in digital form. You reshape how the system operates. You change the scope and structure of human involvement.

This misconception is what researchers Sidney Dekker and David Woods call the “substitution myth.” Their 2002 study makes it clear: allocating tasks based on the idea that “humans are better at this, machines at that” ignores how those allocations actually transform the job environment. That transformation brings new, often unpredictable tasks. These aren’t just more of the same, they’re different in kind. System resilience depends on how well people understand and adapt to this shifting environment, and if you cut them out of the system thinking it’s all been solved by automation, you’re setting yourself up for blind spots you won’t notice until things go wrong.

For anyone leading a software-driven organization, the takeaway is simple but important: automation doesn’t eliminate the need for people, it changes what they need to focus on. That shift in focus could be unseen until there’s a failure. If you’re not designing with this reality in mind, you’re not designing for real-world operations.

Automation’s dual role in incident response

Automation’s role in software operations isn’t straightforward. Sometimes it prevents failures or surfaces them early. Other times, it causes incidents and makes them worse. This is what makes automation a complicated partner, it’s not always working in one direction. It needs oversight. A system that auto-corrects or scales is great until it reacts the wrong way and locks your team out of the very tools they need to recover.

The 2021 Facebook outage is a painful example. An automatic command severed backbone connections that linked Facebook’s data centers globally. That alone triggered a massive DNS breakdown that made servers unreachable. It didn’t stop there. The automation locked employees out of data centers too, putting secure access systems out of reach, forcing delays in physical recovery. The system worked exactly as programmed, it just didn’t account for that scenario. And that’s the problem.

If you’re leading operations or product at an enterprise scale, this should matter to you. It’s tempting to treat automation as a risk reducer. Often, it is. But without feedback loops and visibility built in, automation becomes brittle in crisis. It can create dead ends instead of options. Smart leaders recognize the duality here, automation boosts performance only when human response is part of the design.

Designing for ideal outcomes can overlook critical failure modes

Most automation is built on assumptions, primarily the assumption that things will work as planned. Design teams often focus on optimizing for expected outcomes like speed, accuracy, and reduced manual workload. That’s fine when everything is going right. The problem is, that kind of design thinking tends to ignore the fact that systems don’t always operate under expected conditions. And when automation goes off course, it doesn’t wind down gracefully. It amplifies the error.

This is particularly true in deployment environments like CI/CD pipelines. These systems are meant to push changes fast and automatically. But if a misconfiguration slips in, and it will, it doesn’t just cause a small issue. The change is deployed automatically and system-wide. Without clear indicators or fail-safes, entire platforms can be brought down in seconds, and no one sees it coming until it’s already in motion.

If you’re an executive approving new tooling, you need to ask: “Has failure been modeled?” Not just hypothetically, but in terms of real impact and team response. Assume the automation will fail eventually, that’s not pessimism, it’s operational realism. Design for how the team will respond when that happens. Don’t just measure potential efficiency gains, measure resilience.

Automation may lead to deskilling and reduced situational awareness

Automation can reduce routine work, but it comes with trade-offs. As systems handle more processes without human input, operators observe more and do less. That may sound efficient, but long-term, it weakens understanding. When a person monitors a system passively instead of working with it directly, they lose the tactile, real-time feedback that builds expertise. So when failures happen, they’re less equipped to intervene.

Systems don’t just need oversight, they need informed oversight. That happens when people have exposure to how those systems behave under different conditions, not just when they’re running smoothly. Knowledge becomes concentrated in the hands of a few experts. When one of them is unavailable, the knowledge gap becomes operational risk. And it slows recovery.

If you’re running global infrastructure teams, this should be at the top of your list. Relying on automation without serious investment in skill-building leads to fragile operations. Operators can’t respond effectively to what they don’t fully understand. If your team is staring at dashboards with no clarity on what they’re seeing, the system isn’t really safe, it’s just quiet. Retain competence by rotating people through hands-on work and ensuring tools support active learning. The more people understand your systems end to end, the stronger your incident response becomes.

Automation failures generate new, unanticipated human tasks

When automation breaks, it rarely hands you a clean task list to fix it. Instead, it creates new challenges, often under pressure and typically undocumented. The work humans need to do in these moments is more complex than what’s required during normal operations. They’re not just correcting a failed input, they’re navigating an unfamiliar situation created by the automation.

Most automated systems are designed around predictable workflows. But when things go wrong, those systems don’t gracefully fallback. They create confusion. Now the people in charge are forced to investigate how and why multiple automated steps led to failure. This increase in cognitive load slows everything down and raises error risk in recovery. It’s also more expensive, both in time and effort.

If you’re leading Ops or Engineering, don’t treat automation as a guaranteed simplifier. Plan for its edge cases. Let your teams simulate how automation breaks. Build tooling that helps surface context fast. Otherwise, your most experienced engineers will spend more time reverse-engineering what the automation did than actually fixing the issue. Design automation that’s not just autonomous, but interruptible, inspectable, and reversible when needed.

Reduced transparency in automated systems hinders effective debugging

As automation becomes more layered, visibility into what it’s doing becomes a serious problem. To diagnose failures in these systems, teams need to understand not only the application logic but also the logic behind the automation, including how it responds to inputs, what triggers it, and why it took a particular action. That’s not a trivial ask, especially at scale.

This kind of complexity often results in “expert silos.” A single engineer, usually the one who built or ran the system, knows what’s going on. Everyone else stares at unfamiliar logs or dashboards without any real comprehension. When that engineer is out, on vacation or unavailable, the team is blocked. That’s operational fragility hiding inside technical design complexity.

If you’re overseeing systems at the enterprise level, your automation needs observability built-in from day one. Not just telemetry, but explainability. Make sure the system can tell humans what it’s doing and why. And invest in cross-training. Don’t let critical system knowledge reside in one person’s brain. A resilient organization doesn’t just document, it builds shared understanding across teams. The risk isn’t just downtime, it’s decision-making slowdown when key people are absent.

Joint cognitive systems (JCS) offer a more resilient integration of automation

If we want automation to be more effective, we need to stop designing it to simply “do a job” in isolation. The team is not just the humans, and it’s not just the systems. The most capable operations come from integrating both as a unit. That’s where the concept of Joint Cognitive Systems comes in. It’s a framework where humans and machines work together through shared context, mutual goals, and clear coordination, not isolated task execution.

Gary Klein and his research team have laid out the principles. These include mutual predictability (knowing what the other side is likely to do), mutual directability (being able to shift the other’s actions when needed), and common ground (shared understanding of what’s happening and what matters). These are not abstract ideas, they’re requirements for resilient operations. Without this kind of alignment, your automation decisions can block or mislead your human teams when things get tough.

If you’re leading a company that relies increasingly on machines to manage infrastructure, data, deployment, or monitoring, pay attention to how those machines coordinate with your people. Do they show what they’re doing? Do they respond to changing goals on the fly? Do they help your engineers understand the system state, or make it harder? These are questions about system usability, yes, but more importantly, about system trust. If you want teams to truly move fast and recover fast, you have to build technology that supports and enhances how real people do hard work under pressure.

Neglecting human expertise in AI design exacerbates operational risks

There’s a growing belief that “better AI” will solve most software operations challenges. That’s wrong. It only solves the right problems if the people who design and implement AI deeply understand how human teams actually operate. If that understanding is missing, the AI systems not only fail more often, they fail bigger, and faster.

Human expertise is not a plug-in. It emerges through experience, context, accumulated judgment, and the ability to adapt under stress. Most AI systems aren’t designed to support that. They’re not built to augment the situational awareness of engineers. They’re not built to explain themselves or accept correction when they’re wrong. And if they are missing that, they will push teams into high-risk states just when steadiness is needed most.

If you’re assuming a software platform will eventually run itself, you’re headed for expensive problems. Autonomous systems will still require human clarity, intervention paths, and shared operational models. If AI tools can’t contribute to team coordination or decision-making in real time, then those tools won’t scale value, or resilience. They’ll just shift the load.

Build systems that respect how human expertise works. Train AI and automation to work with, not around, people. The faster you align autonomous action with practical human coordination, the more durable your infrastructure becomes.

In conclusion

If you’re betting on automation to unlock scale, speed, or efficiency, you’re not wrong. But if you’re assuming it reduces complexity on its own, you’re missing the bigger picture. Automation changes how your teams work, where they focus, and how they respond under pressure. It’s not just about what gets automated, it’s about what gets redefined, and who still needs to think clearly when things go sideways.

Systems don’t run themselves. At least not in a way that’s safe, sustainable, or scalable without human expertise tightly built into the loop. Whether it’s AI, CI/CD, or operational tooling, automation won’t solve human blind spots, it will expand them unless you address how people and systems align in real environments.

Leaders need to go beyond performance metrics and think in terms of resilience. That includes investing in tooling that’s transparent, teams that are cross-functional, and systems that assume the unexpected. Automation isn’t just a product of software, it’s a reflection of how you build teams, coordinate knowledge, and plan for pressure.

The payoff is big if you get this right. But it starts with a clear view of what automation really does, and what your people still need to do when it matters most.