Generative AI can reduce developer productivity among experienced programmers

We’re seeing something interesting happen. The tools designed to help developers go faster, generative AI tools, are in some cases slowing them down. That means we need to look at how they really perform in a work context, not how they perform in marketing slides.

A study by METR, a nonprofit research group, took 16 seasoned open-source developers, each with over a decade of experience, and lined them up against some familiar genAI platforms like Cursor Pro and Anthropic’s Claude (versions 3.5 and 3.7 Sonnet). The setup was straightforward: real programmers, fixing bugs and building features in projects they already knew. The kind of work developers do every single day.

These developers expected genAI to shave off 20% or more from their task times. That’s not what happened. On average, they took 19% longer to finish the work compared to developers who didn’t use AI tools at all. That’s real-world friction, more time burned on figuring out prompts, scanning auto-generated code, and fixing edge cases including security vulnerabilities.

This is exactly what Domenic Denicola from Google Chrome noted. He worked directly with these genAI tools and said he was surprised “how bad the models are at implementing web specifications.” That’s not a minor shortfall, it points to a knowledge gap between what the model produces and what’s truly needed for a live, secure system.

For executive teams, the message is clear. If your developers are highly skilled, integrating genAI into their workflow may feel more like babysitting an intern than working with a co-pilot. Before you roll out AI-enhanced code support across your team, check whether it’s actually increasing throughput, or just creating more work.

Coding benchmarks exaggerate productivity gains by ignoring use cases

A lot of AI companies pull those numbers from coding benchmarks that were never built to reflect real-world software projects. These tests often favor volume and speed over accuracy and nuance. Sure, models can rip through script files in milliseconds. But when they’re asked to solve messy, interconnected issues in real code environments, the whole picture changes.

The truth is, these benchmarks are more about scale than realism. They’re designed to show off what the model can do under perfect lab conditions, not how it behaves when pointed at enterprise systems with legacy code, edge-case bugs, and actual engineering standards. That distinction matters. If you’re investing based on “coding tasks completed per minute,” you’re missing what developers are actually doing, thinking, problem-solving, fixing other people’s mistakes.

The METR study was designed differently. It wasn’t based on artificial challenges. It measured how experienced programmers actually interacted with AI on familiar projects. It tracked the real effort required to get useful results, validate output, and deploy code.

C-suite leaders should approach benchmark results with healthy skepticism. AI will absolutely reshape how we work and build, but don’t base internal ROI expectations on numbers that don’t reflect your real work environment. Ask your teams what they see when they put these things into action. If benchmarks impress and real performance lags, you’ve got your answer.

GenAI tools often produce low-quality code that requires extensive revision

There’s this idea that generative AI gives you code fast, and gets you closer to done. That’s partially true, but what gets overlooked is what it takes to make that code actually work.

Developers using genAI often find themselves cleaning up a lot of the mess. The AI can produce something that looks good, maybe even runs, but beneath that surface is usually a mix of redundant logic, poor architectural decisions, and bugs. Fixing that costs time. And when it comes to quality-limited systems, products with customer exposure or security implications, cutting corners isn’t an option.

Developers on platforms like Reddit consistently share the same pattern: genAI delivers maybe 80% of the work quickly, but the remaining 20%—the part that makes the code viable, is where the time really goes. Removing duplicated structures, rewriting flawed logic, and aligning the code with actual system requirements takes longer than if the task had just been approached from scratch by an experienced coder.

So what’s the impact here for executive leaders? It’s important to recognize that just generating code doesn’t mean you’re generating value. The engineering effort required post-generation might neutralize any speed advantage. If you’re building for production environments or anything customer-facing, you want engineering output that’s robust.

Don’t mistake generative efficiency for total efficiency. If your teams are shipping slower, or not trusting the AI output enough to ship at all, you’re not seeing productivity.

Inexperienced developers relying on GenAI risk losing core technical skills

There’s a deeper risk emerging with genAI. For those still building foundational skills, early-stage developers, junior engineers, it’s changing how they learn and what they retain. GenAI gives them shortcut access to functional code, but lacks the context and feedback developers usually rely on to actually grow.

The issue isn’t producing something that runs. It’s understanding how and why things work. That skill gap matters. When AI-generated code fails, and it does, those who lean too heavily on it often don’t know how to identify the source of the breakdown, let alone how to fix it. Without a core foundation in debugging and system design, the developer effectively stalls.

Kaustubh Saini, a technical writer observing this trend, put it well. He said genAI is producing a generation of coders who can generate but not understand. That’s a big problem. They struggle to debug, maintain, or improve their own code because the core logic isn’t theirs.

As an executive, you want your teams to move fast, and grow fast. But when junior talent leans too far into genAI without developing real engineering intuition, they plateau early. That affects long-term scalability, team reliability, and problem-solving capacity. GenAI should assist developers, not replace the learning process.

This is a good time to rethink how your teams use AI. It’s not about limiting access, it’s about setting clear expectations. Developers need to understand the systems they’re building. If that knowledge isn’t growing, then genAI becomes a ceiling, not a springboard.

Productivity pitfalls seen in software development with genAI are also emerging in other professional fields

What we’re seeing in software isn’t just limited to software. GenAI is facing similar performance issues across multiple industries, especially when the output is measurable, like in writing, communications, or design. The pattern is consistent: speed goes up, but quality often goes down.

Take writing, for example. Articles, reports, and marketing content generated with genAI platforms may seem competent at first glance. But look closely and problems emerge, fact errors, poor structure, unclear messaging. It’s production without precision. In many settings, companies are pushing out content that’s passable, but not strong enough to deliver strategic outcomes. And that’s assuming no one checks it too carefully.

The same situation is likely playing out across service-based and creative roles. Organizations are using AI to hit output goals, but they’re not always getting output they can trust. And when trust isn’t there, teams spend more time fact-checking, cleaning up, and revising, pulling experienced professionals into low-leverage work just to recover quality.

For executives, this isn’t just a software development issue. It’s a signal. If genAI is being positioned as a workforce multiplier across your business, you should look closely at what it’s actually producing. Higher output doesn’t automatically mean higher performance. Systems that depend on quality control, client communications, public content, investor reports, still need people who know how to assess and correct what AI gets wrong.

Use genAI where it creates leverage. But don’t replace your experts with tools that aren’t producing expert-grade work.

High-Level endorsements of genAI’s productivity potential contrast sharply with its performance in practical tasks

You’ve likely heard the big statements from top industry names, claims that genAI will drive global GDP, boost productivity, and reshape the job market. Some of that’s absolutely true. But the excitement at the top often doesn’t line up with the mechanics at the team level.

Jensen Huang, CEO of Nvidia, said that AI will augment every job. Satya Nadella, CEO of Microsoft, noted that Copilot contributes to over 30% of Microsoft’s codebase and is “transforming the developer experience.” Doug Matty, Chief AI Officer at the U.S. Department of Defense, described AI platforms like Grok as key to maintaining strategic advantage. These are strong statements from people leading companies where AI is foundational.

But here’s the thing, real-world users aren’t experiencing that transformation at the same velocity. According to the 2025 METR study, seasoned developers using genAI were 19% slower on task completion than their peers who didn’t use the tools. The 2024 DORA study reached similar conclusions: while genAI did help teams speed up code review, the quality of AI-generated code was often too weak to ship without major rework.

This is a reality check. GenAI is not yet ready to consistently deliver end-to-end performance in real-world operations. It brings a productivity promise, but in many cases, the actual gains are uneven once you account for debugging, validation, and cleanup. The claims from the top may reflect vision.

As an executive leader, it’s your job to split hype from usable value. You should absolutely explore and invest in genAI. But if your teams are telling you it’s not delivering above baseline today, that’s not resistance, it’s insight. Instead of pushing immediate ROI, build guided infrastructure, define where oversight is required, and treat productivity claims as inputs.

Key executive takeaways

  • Generative AI slows expert developers: Despite expectations, seasoned devs using genAI tools were 19% slower due to prompt crafting, code validation, and debugging. Leaders should evaluate team expertise before rolling out AI at scale.
  • Benchmarks don’t reflect real work: AI performance metrics are based on artificial benchmarks that don’t match real-world dev tasks. Execs should prioritize hands-on team feedback over vendor claims to assess actual productivity gains.
  • AI-generated code needs cleanup: GenAI outputs often require extensive revisions for proper structure, security, and maintainability. Leaders must allocate time for post-generation review when planning AI-integrated workflows.
  • Junior developers risk skill erosion: Over-reliance on genAI is reducing foundational skills among less experienced devs who can’t troubleshoot or adapt code. Investment in developer training is critical to prevent long-term capability gaps.
  • Quality drops in other domains too: AI introduces similar inefficiencies in writing, content, and knowledge work, often producing low-accuracy outputs. Cross-functional teams should implement QA steps before trusting AI-generated content.
  • Executive vision outpaces operational results: Public endorsements from AI leaders contrast with underwhelming results at the team level. Leaders should temper strategic AI adoption with workflow-level checks to avoid scaling inefficiencies.

Alexander Procter

September 15, 2025

9 Min