AI coding tools can reduce productivity for experienced developers

There’s a strong belief across the tech industry that AI development tools make experienced developers faster. That belief doesn’t hold up to the reality we’re seeing in actual production environments. Recent findings from Model Evaluation & Threat Research (METR) show that experienced developers using AI tools like Cursor Pro and Claude were 19% slower in completing real-world coding tasks compared to doing the work manually.

This wasn’t a casual observation. It came from a well-run, randomized controlled trial involving 16 seasoned developers who worked on live, complex codebases, projects they were already deeply familiar with. The lesson is simple: AI suggestions don’t always align well with how these experts already work. In mature, large-scale software systems, the context and quality matter more than raw suggestions. When the AI doesn’t truly understand the broader architecture at play, it becomes an interruption, not a boost.

Now, this doesn’t mean AI tools have no value. What it means is that productivity lifts from these tools are not universal. Leaders need to understand the environments they’re deploying AI in. If it’s a mature codebase maintained by experts who already run highly optimized workflows, layering AI into that mix can introduce extra steps instead of eliminating them.

For enterprises investing into AI for development at scale, efficiency isn’t just about writing code faster. It’s about everything that comes after, the review, the churn, and the integrations. If we care about actual productivity, we need to count how much work AI creates downstream, not just what it outputs.

A major perception gap exists between expected and actual AI performance

Before the METR study started, developers expected AI tools to shave 24% off their task times. Even after completing the study, and despite objectively performing 19% slower, they still said AI made them 20% more productive. That kind of disconnect isn’t just dangerous, it distorts how we measure success.

This overconfidence wasn’t limited to developers. According to the study, economists predicted a 39% improvement in productivity from AI. Machine learning specialists pegged it slightly lower at 38%. The numbers look good on paper, but reality pressures matter more. The perception of speed or enhanced productivity seems to be more about how AI feels to use, rather than what it actually delivers. The tools reduce mental effort, sure. But that doesn’t translate directly to getting more done.

For leaders running tech operations or overseeing transformation projects, this is a warning shot. Don’t confuse user satisfaction with actual outcomes. A developer who says they “feel more productive” may not be shipping faster or better code. You can’t let internal narratives about productivity run unchecked when the data says something else.

The path forward starts with measuring performance, real performance, not sentiment. Don’t go by how people feel about the tools. Go by outcomes that scale across teams and connect directly to business metrics.

Developers frequently modify AI-generated code, limiting its efficiency gains

When experienced developers use AI coding tools, they rarely accept the output at face value. What we see in the METR study is that developers accept fewer than 44% of the code suggestions provided by AI. Beyond that, 75% reviewed every single line, and 56% reported making significant changes to the proposals. That tells you AI isn’t automating the tough parts of software development, it’s merely participating.

This is where assumptions about automation break. Writing code isn’t just typing characters. Developers are maintaining style conventions, enforcing security practices, and working within highly customized architectures. AI tools don’t understand that context deeply enough, so human review becomes necessary. The code might compile, but that doesn’t mean it fits.

For enterprises pouring resources into AI adoption, this is important. If your teams are still cleaning up AI code, modifying structures, and running extra reviews, you’re increasing task load, not reducing it. Even simple mistakes by AI can introduce more work later. Large systems can’t afford frequent fixes caused by context-insensitive or inconsistent code suggestions.

The right way to deploy AI in these scenarios is targeted. Use it where suggestion precision matters less, documentation, test scaffolding, boilerplate, but keep it limited in areas requiring domain depth. Time spent correcting AI’s near misses could be better spent doing the work correctly from the start.

AI struggles with integration in complex, mature codebases

AI tools show real limitations when operating within large, well-established codebases. These codebases come with deep interdependencies, strict standards, and evolved architecture patterns. The METR research confirmed that developers experienced the most slowdown when working on projects they were already highly familiar with, indicating that AI disrupted rather than supported their flow in these environments.

The core issue is contextual understanding. AI quickly generates suggestions, but large codebases demand precision tailored to history, architecture constraints, and team preferences. When AI doesn’t offer that precision, experienced developers step in, to interpret, adjust, and fix. The result is slower task completion and increased mental switching costs.

This has strategic implications. For CIOs and CTOs looking to integrate AI tools into enterprise systems, the value depends on where and how these tools are applied. Dropping AI into legacy systems or mission-critical environments without structured evaluation will likely backfire, through delays, bugs, or wasted engineering time.

You don’t need to reject AI entirely. Just be deliberate. Identify areas where the stakes are lower, where architectural complexity is minimal, and where AI can help without introducing cleanup. That’s how you avoid regression masked as innovation.

Rigorous, controlled testing enhances the credibility of AI productivity findings

Most AI productivity claims are built around limited or idealized scenarios. What separates the METR study is its use of a randomized controlled trial under real-world conditions. This wasn’t a lab environment. Developers were assigned real coding work on repositories they had been contributing to for years, averaging over one million lines of code and more than 22,000 GitHub stars. The tasks weren’t abstract exercises. They were actual pieces of production work.

This distinction matters. Most other studies that show optimistic numbers around AI-assisted development rely on isolated, simple tasks. They don’t factor in organizational scale, codebase complexity, or developer familiarity. METR’s approach does. Developers used Cursor Pro in combination with Claude 3.5 and 3.7 Sonnet between February and June 2025, with every session screen-recorded to capture their full workflow. Tasks averaged two hours each, providing actual engagement data, not speculation.

For executives evaluating AI initiatives in software delivery, this kind of testing sets the bar. It tells you the conditions under which AI adds or subtracts value. Vendor benchmarks and developer self-assessments don’t give you operational clarity. Controlled experiments with realistic constraints do.

If your goal is to optimize engineering throughput, you can’t rely on surface-level metrics. Start demanding trials that are contextual, consistent, and connected to business-driven outcomes. Anything else won’t hold up in live systems.

Broader industry evidence underscores trust and performance concerns with AI tools

Beyond METR, larger-scale industry data raises more questions than confirmations when it comes to AI productivity in software development. Google’s 2024 DORA report, which surveyed over 39,000 tech professionals, shows a clear disconnect. While 75% of respondents said they felt more productive using AI tools, the underlying system performance metrics went down.

When AI adoption increases by 25%, delivery speed drops by 1.5%. More worrying for anyone maintaining large-scale systems, system stability drops by 7.2%. These aren’t small deviations. They signal a measurable degradation in the quality of delivery as AI involvement rises. Additionally, 39% of developers reported low or no trust in the code produced by AI. That hesitance slows down review and approval pipelines, further impacting cycle time.

The narrative that AI makes teams faster and smarter is appealing, but incomplete. Perception-based improvements in productivity don’t always correlate with better system outcomes. If trust is missing, velocity slows. If generated code introduces bugs or inefficiencies, you accumulate hidden technical debt.

This highlights the need to integrate AI under measurable, enforceable standards. Don’t assume that satisfaction equals success. Watch the metrics, delivery, system reliability, and review lag. AI must win on those terms or it isn’t helping your business.

Despite slower productivity, developers continue to value AI tools

Even after experiencing slower task completion, 19% slower on average, developers continued using the AI tools. Specifically, 69% of study participants chose to keep using Cursor Pro after the METR study concluded. This signals something important: developers see value in AI tools that goes beyond sheer speed.

The tools reduce cognitive load. They help with small but frequent tasks like code formatting, documentation, or test scaffolding. These low-effort use cases give developers more mental bandwidth, even if they don’t translate directly into faster execution. In practice, this results in a perception of improved workflow, even when the data shows otherwise.

For business leaders, this highlights a subtle but critical insight. Developer adoption of AI may be driven by qualitative benefits that don’t immediately register on output dashboards. Satisfaction, ease of mental processing, and sense of flow all matter, but they don’t necessarily equate to measurable efficiency. If these tools improve morale or reduce fatigue without hurting quality, they still add value. But if you’re focused on output, those assumptions need rigorous testing.

You can’t ignore that the majority of developers voted with their actions and kept using the tools. But decision-makers should ask, where does it actually help, and where is the benefit just soft perception? That distinction affects how AI tools should be procured, evaluated, and scaled across teams.

Enterprises must adopt a nuanced, contextual approach when integrating AI

Treating AI as a one-size-fits-all solution in software development doesn’t align with how teams actually work. Sanchit Vir Gogia, CEO and Chief Analyst at Greyhound Research, made this clear, automation needs to be specific and strategic. AI copilots are useful, but only in areas where they amplify cognitive capacity without degrading productivity or code quality.

Documentation, boilerplate code, and test generation are clear areas where AI can perform reliably. But in segments of the workflow that depend heavily on context, system architecture, or domain knowledge, AI risks adding friction. Gogia urged companies to elevate the rigor of their evaluation frameworks. That means going beyond internal satisfaction scores or vendor case studies, get real, quantifiable productivity metrics tied to delivery timelines, peer review cycles, and rework.

CIOs, CTOs, and engineering leaders should approach this with portfolio thinking. Use AI where it adds utility. Hold back where it doesn’t. Make sure your deployment strategy includes mechanisms to track whether AI tools are creating rework or reducing it. High-accuracy metrics matter more than broad adoption.

Governance is key. Putting AI tools in critical paths without policy, structure, or oversight introduces hidden risk. If you’re scaling automation, you need a measurement layer that goes beyond usage frequency or perception, and actually shows where the performance curve is going.

In conclusion

AI in coding is a tool. And like any tool, where and how you use it determines the value you get. For experienced developers working in complex, mature systems, AI can introduce more friction than speed. That doesn’t mean you throw it out. It means you stay deliberate.

Satisfaction, adoption, and perceived productivity are easy to measure. What matters more is what happens downstream, system stability, code review times, rework, delivery velocity. If you’re not tracking those, you don’t have the full picture.

The smart approach is selective integration. Deploy AI copilots in the right parts of your stack, documentation, boilerplate, repetitive QA work, but avoid dropping them blindly into mission-critical workflows. Build governance, not just excitement.

Alexander Procter

September 26, 2025

10 Min