AI can handle Python just fine but struggles almost everywhere else

Large language models (LLMs) are error-prone and unreliable for intricate, multi-step work tasks

The recent Microsoft preprint study, LLMs Corrupt Your Documents When You Delegate, cuts through the hype surrounding AI capability. The results are clear: today’s large language models still make serious, compounding mistakes when handling complex workflows. The research team, led by Philippe Laban, Tobias Schnabel, and Jennifer Neville, used a benchmark called DELEGATE‑52 that simulated real-world work environments across 52 professional domains. It tested how 19 different LLMs handled tasks running multiple steps long, precisely the kind of operations knowledge workers perform daily.

The results show that LLMs tend to degrade performance when repeatedly editing or refining the same documents. Frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost around 25% of document content after just 20 delegated interactions. Across all tested models, the average degradation climbed to about 50%. That’s a dramatic level of loss, especially for enterprises that depend on document precision in areas like contracts, codebases, or compliance material.

For executives, this should not be discouraging but instructive. The message isn’t that AI systems fail, it’s that their current design isn’t yet reliable enough for fully autonomous workflows. Delegating too much to an unverified model is like trusting an intern to manage your key data pipelines alone, it doesn’t end well. These findings underscore that precision still needs human supervision and automated checks. Enterprise automation is powerful, but accuracy in decision-critical documents remains non-negotiable.

When evaluating AI tools, focus on measurable reliability. The strongest companies will deploy LLMs in structured environments, where transparency, tracking, and revision integrity are built in. With tighter frameworks and smarter validation processes, the technology will cross this reliability gap. Until then, a balanced approach, one that combines AI utility with disciplined oversight, wins every time.

Python programming is the only domain where most LLMs show readiness

The same Microsoft study revealed something important: current AI systems can perform reliably only in specific, structured domains. Among the 52 tested, Python programming stood out as the single area where most large language models produced consistent, accurate work. The best model managed to deliver competent performance in just 11 out of 52 domains, meaning broad readiness simply doesn’t exist yet.

Sanchit Vir Gogia, Chief Analyst at Greyhound Research, highlighted this gap as a clear signal that enterprises must be selective: success depends on matching the technological strengths of LLMs with the right use cases. In practical terms, that means automating narrow, clearly defined tasks such as code suggestions or syntax review before trusting these systems with more abstract or nuanced work like legal writing, accounting summaries, or archival data management.

For leaders, the strategic takeaway is prioritization. Deploy AI where it demonstrably works and avoid forcing it where the risk of corruption or context loss is high. Models excel in highly structured and rule-based environments like programming, but they falter in contexts that require subtle interpretation, judgment, or cross-domain reasoning. The maturity gap between domains is wide, and understanding it prevents costly implementation mistakes.

Executives should view this as a planning advantage rather than a limitation. By aligning technology adoption with domain readiness, businesses can incrementally scale their AI strategies, starting with stable, verifiable tasks and gradually expanding toward more complex operations. That’s how you build dependable automation: one validated domain at a time.

AI’s primary challenge lies in preserving document integrity during repeated revisions rather than in content creation

The findings from Microsoft’s LLMs Corrupt Your Documents When You Delegate study expose an important truth about the current generation of AI models. While they’re exceptional at generating fresh content, they struggle to preserve accuracy as tasks become iterative. Every edit introduces small inconsistencies, which multiply over time. The issue grows worse as document size, complexity, and context noise increase, exactly the conditions found in real enterprise environments.

Sanchit Vir Gogia, Chief Analyst at Greyhound Research, emphasizes that the problem is not the much-discussed “hallucination” effect, but something more fundamental: the failure to maintain artefact integrity. LLMs often replace precision with surface‐level plausibility, resulting in subtle corruption of facts, structure, or logic. This makes AI output look fine at a glance but unreliable upon close examination.

For executives, this insight matters because it defines where risk lives in automation. Most organizations deal with documents that evolve over multiple iterations, contracts, compliance records, technical briefs, or policy drafts. If the underlying system can’t maintain original meaning or accuracy consistently, decision-making that depends on these records can be compromised.

Decision-makers should require transparency metrics for AI workflows, systems that show degradation or content drift as they occur. Enterprises must invest in control layers that monitor output consistency over time. As AI continues to mature, preserving document integrity will become a top performance benchmark, alongside creativity and logic accuracy.

Delegated AI is not yet dependable enough for fully autonomous enterprise operations

The current level of AI capability does not yet support full autonomy in enterprise environments. According to Microsoft’s research, even advanced models make compounding errors that silently damage critical documents. While AI can assist in automating workflows, trusting an LLM to operate without oversight exposes organizations to risks affecting accuracy, compliance, and data integrity.

Brian Jackson, Principal Research Director at Info‑Tech Research Group, highlighted that automation systems need safeguards. He notes that successful deployments require “stronger guardrails”—processes that verify, correct, and validate AI output before it reaches systems of record. Instead of complete automation, businesses should design processes where AI and humans share accountability, AI handles routine execution, while human reviewers ensure reliability and context accuracy.

For executives overseeing digital transformation, the takeaway is clear: autonomous AI is not yet enterprise‑ready without disciplined architecture supporting it. The current generation of models excels when operating inside frameworks that constrain and validate their actions. Such frameworks should define the limits of autonomy, include multi‑layer output checks, and maintain traceability from input to result.

Organizations that depend on zero‑error output, such as those managing financial ledgers, policies, or compliance data, must maintain human checkpoints until AI systems demonstrate provable resilience over lengthy, complex interactions. Businesses can still achieve operational gains by combining supervised AI deployment with process redesign, ensuring consistency without surrendering control.

Effective mitigation of AI errors requires enhanced structural approaches

The Microsoft study makes it evident that reliable AI integration requires intentional design, not blind trust. Enterprises can reduce model degradation by improving the supporting structures around AI workflows. This includes advanced fine-tuning using domain-specific data, rigorous testing protocols, and intelligent verification systems that cross-check model outputs. A well‑designed feedback loop is essential to ensure stability across repeated interactions.

Brian Jackson, Principal Research Director at Info‑Tech Research Group, pointed out that while multi‑agent systems, where one model performs a task and another reviews it, can reduce some risks, poor setup can amplify degradation instead. The Microsoft paper identified that even a flawed multi‑agent configuration created more errors than a single model operating alone. Consequently, these architectures must be built with precise verification logic rather than assumed to self‑correct.

For enterprise leaders, the message is concrete: customization is critical. Models trained broadly across many disciplines do not automatically perform well on specialized corporate tasks. Applying internal training data sharpens accuracy and ensures the AI adheres to business‑specific language, workflows, and priorities. Platforms that support deterministic verification, mathematical methods to confirm the accuracy of results, should also be prioritized for sensitive domains like finance, healthcare, and compliance.

Executives should view verification as a core operational function, not an optional safeguard. Establishing structured validation layers, regular model audits, and retraining cycles anchored in measurable performance metrics will make AI outputs dependable and explainable. With this approach, the enterprise can achieve more automation capacity without compromising control or credibility.

Human expertise remains indispensable as AI shifts from a production role to one of supervision and accountability

The study makes a realistic point that should resonate with business leaders: the more advanced the AI becomes, the greater the demand for human oversight. As AI takes over parts of knowledge work, human roles evolve rather than disappear. Skilled professionals are essential for monitoring, validating, and holding AI systems accountable for their outputs. This shift moves humans from creation to supervision but increases the value of specialized knowledge within enterprise processes.

Sanchit Vir Gogia, Chief Analyst at Greyhound Research, noted that the individuals best equipped to detect AI‑introduced errors are often the same experts organizations consider replacing. When companies reduce domain knowledge too far, they remove the last safeguard capable of identifying subtle data corruption or logical drift. Even frontier models sometimes alter facts, tone, or context in ways that are difficult to detect without deep understanding of the topic.

For executives, this insight signals the need to reframe workforce strategies. Redundancy goals should give way to capability reinforcement. Enterprises should invest in upskilling employees who can perform technical validation, contextual review, and compliance monitoring on AI‑supported outputs. These adjusted roles ensure that governance and accountability remain within human judgment, protecting brand trust and operational reliability.

Forward‑thinking leaders will integrate AI governance directly into their organizational design, combining model performance tracking with human oversight. This level of structured accountability ensures consistent quality without slowing innovation. The most competitive companies in the next decade will not simply adopt AI; they will master the balance between automation speed and human discernment.

Key takeaways for leaders

AI performance still lacks reliability for complex workflows: Microsoft’s study found that major large language models lose up to half of document integrity after repeated edits. Leaders should maintain human checkpoints and verification layers before scaling AI into business-critical processes.
Python is currently the most dependable domain for AI: LLMs performed best in Python but fell short in 41 of 52 test areas. Executives should deploy AI first in structured, code-based domains where accuracy can be measured and proven before expanding its role in broader operations.
Document preservation remains AI’s biggest weakness: AI can generate content effectively but degrades quality and integrity across repeated revisions. Leaders should demand audit trails and integrity tracking within AI workflows to maintain trust and accountability in institutional records.
Autonomous AI must still operate under human oversight: Delegated AI remains prone to silent errors that undermine compliance and reliability. Decision-makers should design workflows that embed human validation at critical stages and define clear limits for autonomous operation.
Model fine‑tuning and verification are essential safeguards: Multi‑agent systems and customized training reduce the risk of degradation but must be precisely engineered. Executives should invest in domain‑specific fine‑tuning and deterministic verification tools to maintain accuracy at enterprise scale.
Expert oversight gains importance as AI evolves: Advanced AI requires equally advanced human reviewers to detect subtle distortions. Leaders should retain domain experts, retrain them for supervisory roles, and position human oversight as a strategic advantage in AI‑driven operations.