Generative AI models are highly unreliable and frequently wrong

When we talk about generative AI, most people picture advanced systems capable of responding instantly, writing code, summarizing documents, or even creating business strategies. That’s true, to an extent. But here’s the thing, these models are wrong a lot. Not just occasionally. A lot, as in 60% of the time.

Allie Mellen, Principal Analyst at Forrester, put it simply: “AI is wrong. It is wrong not just a little bit; it’s wrong a lot of the time.” At Forrester’s 2025 Security and Risk Summit, she laid out the truth behind the hype. Generative AI models, like ChatGPT and Gemini, aren’t smart in the human sense. They create answers that sound confident, but frequently lack accuracy or context. These systems generate outputs based on probability, not understanding. That’s a critical difference executives need to keep in mind.

Columbia University’s Tow Center for Digital Journalism tested eight major models. The result? 60% of their responses were incorrect. Let that number sink in. These tools are being adopted across industries, from customer service to threat detection, yet the majority of their answers are not reliable. What’s even more concerning is this: the models don’t fail quietly. They respond boldly, often misleading users into thinking the output is trustworthy.

If you’re building AI into your operations, you need to be direct about the risks. Don’t assume the model knows what it’s doing. Set up safeguards. Verify outputs. Use them as accelerators, not final decision-makers. That mindset keeps organizations agile without becoming fragile.

AI agents consistently fail at executing real-world corporate tasks

There’s excitement about AI agents stepping into roles across the enterprise, from automating customer email replies to managing infrastructure monitoring. It sounds great. But the current reality is more sobering: AI isn’t finishing the work it’s being assigned. In fact, it’s failing most of the time.

Jeff Pollard, Forrester’s VP and Principal Analyst, presented data that’s hard to ignore. He shared results from Carnegie Mellon’s AgentCompany benchmark. They tested top-tier models like Claude 3.5 Sonnet and GPT-4 against 175 practical business tasks. The best performers only got about 24% of them done without help. Add complexity, and the failure rate climbs fast, hitting 70 to 90%.

That’s not performance. That’s proof of immaturity in the technology. Even Salesforce’s own internal research, shared at Dreamforce 2024, showed CRM-focused AI agents failed to complete 62% of baseline tasks. When confidentiality controls were added to enforce data safety, failure rates went past 90%. These aren’t just bugs. They’re fundamental limits on capability right now.

Here’s the key takeaway for leadership: Do not over-automate based on faith in the model. That approach creates blind spots. Use AI to support your people, not replace them, because this technology, in its current form, lacks the precision and accountability required in enterprise environments.

If you’re integrating AI into mission-critical workflows, test it hard. Introduce fallback options. Track its accuracy against KPIs. Until reliability improves, leadership should treat AI agents as apprentices, not experts. That mindset keeps your reputation, quality, and customer trust intact.

AI-generated code presents serious security risks due to embedded vulnerabilities

AI is writing more code than ever. That might seem like progress. But there’s a critical issue, almost half of that code introduces security flaws. This isn’t speculation; it’s measured.

Veracode’s 2025 GenAI Code Security Report tested 80 coding tasks across more than 100 large language models (LLMs) in four programming languages: Java, Python, C, and JavaScript. What they found is direct and undeniable: 45% of the AI-generated code contained known OWASP Top 10 vulnerabilities. These include problems like SQL injection, log injection, and cross-site scripting. Even small mistakes in these areas can give attackers full access to databases, systems, or critical infrastructure.

The real concern here is not just about the frequency of security flaws, it’s about the disconnect between how well the AI can organize code and how securely it does so. More recent models can produce clean, compilable code, but their security performance hasn’t improved. This shows that larger training sets and refined syntactic patterns aren’t fixing the underlying risk. Vulnerabilities are still getting through at scale.

Language-specific results paint a clearer picture for tech leaders. Java had the lowest pass rate for secure code at 28.5%. Python, C, and JavaScript performed better but still left major gaps, especially with issues like cross-site scripting, where pass rates were as low as 12–13%. That means eight out of ten outputs from AI tools are unsafe in certain environments.

If teams are building software using LLMs, this needs executive attention. Incorporating AI into the dev lifecycle without enforced security checks is operationally risky. The solution is not to shut down AI usage, rather, it’s to layer in security validation tools, educate developers around the behavior of AI-generated code, and enforce reviews before production. Let the AI assist, but never guarantee safety without human oversight.

AI proliferation exacerbates identity sprawl and increases attack surfaces

As AI expands its footprint across businesses, it’s creating a new kind of risk, identity sprawl. This isn’t just about human logins anymore. AI agents, APIs, machine credentials, and ephemeral tokens now outnumber traditional users in many environments. And every new identity can become a way in for attackers.

Merritt Maxim, VP and Research Director at Forrester, put it clearly: “Identity security is undergoing the most significant shift since SSO went mainstream.” There’s a shift in how entitlements work, static permissions are being replaced by just-in-time access, often granted automatically and revoked minutes later. These dynamic entitlements increase flexibility, but also complexity. Security teams need to manage access in real time, with clear boundaries, or face breach-level consequences.

The August 2025 OAuth token breach is a prime example of this risk becoming real. Over 700 Salesforce customers were affected. What got exploited weren’t passwords or firewalls, it was OAuth tokens and API credentials. These are machine identities. If they aren’t governed like human credentials, they become blind spots that adversaries can exploit quickly and at scale.

AI adoption also brings in shadow systems, unapproved AI tools that employees integrate without going through security processes. A recent Veracode finding shows 88% of security leaders admit to having unauthorized AI in their organization. That means nearly every enterprise is currently operating with unknown, unmanaged risk vectors created by their own teams.

Managing this starts with visibility. Without knowing how many machine identities exist within your infrastructure, governance is impossible. Security leaders need to expand identity and access management (IAM) to explicitly cover machine and AI-generated identities. Forrester projects this gap will fuel the IAM market’s growth to $27.5 billion by 2029.

If governance doesn’t evolve at machine speed, attackers will move faster. Assign teams to audit AI-created credentials, rotate them automatically, and prioritize policy enforcement across human and non-human identities alike. That’s how organizations stay resilient in this phase of enterprise AI adoption.

AI must be governed as a distinct, high-risk identity class within organizations

We’re entering a phase where AI agents execute tasks, access data, and interact with core systems. That makes them more than just software, they’re operational entities, and they need to be governed like identities. Not as an extension of existing human or machine roles, but as a unique class with its own risks, behaviors, and privileges.

Andras Cser, VP and Principal Analyst at Forrester, made the point very clear: “AI agents sit somewhere between machines and human identities; high volume, high autonomy, high impact.” The systems we’ve used to manage user roles or service accounts weren’t built for this level of autonomous interaction. Traditional IAM tools don’t offer the monitoring granularity or the speed to adapt as AI agents spin up new tasks or processes dynamically.

The implication for executives is direct. The more these agents proliferate, the more potential exposure points are created, not necessarily from bad intentions, but from a lack of structured oversight. AI agents can initiate transactions, move data, and respond in real time. Without governance purpose-built for this entity type, organizations lose visibility over access rights, change history, or accountability triggers.

Legacy IAM architecture breaks down in this environment. These tools expect fixed roles, stable privileges, and linear workflows. AI doesn’t follow that. Governance needs precision and adaptability. Controls must be continuous, not static, monitored in real time, with the ability to revoke or limit scope based on context and system performance.

For decision-makers, the ask is clear: build governance around how these AI agents function. Treat them as first-class identities. Use dedicated platforms that provide real-time visibility, event-driven authorization, and full audit trails across interaction points. That’s how to reduce exposure, enforce compliance, and maintain accountability as AI continues to scale.

AI red teaming is essential to detect and address model-specific vulnerabilities

Until recently, red teaming focused on testing traditional infrastructure, network vulnerabilities, endpoint defenses, and configuration exposure. That focus is shifting. With AI agents increasingly embedded inside operational systems, the target isn’t always the network anymore. It’s the AI model itself.

Jeff Pollard of Forrester emphasized this transition: “Infrastructure flaws matter, but AI model flaws are what will break you.” He’s right. These systems introduce a completely new set of risks, prompt injection attacks, bias exploitation, model inversion, and failures that compound when multiple AI agents interact. Standard penetration tests won’t catch these.

This isn’t about theoretical risks. Carnegie Mellon’s AgentCompany benchmark showed top models failing at real-world tasks 70–90% of the time when complexity rose. These aren’t isolated bugs; they’re systemic performance breakdowns. Salesforce’s 2024 Dreamforce session echoed the same with CRM-focused AI agents failing more than 60% of baseline tasks, numbers that climbed even higher once protective constraints were applied.

AI red teaming is how companies adapt to this. It simulates adversarial behavior trained explicitly to expose model weaknesses, behaviors that exploit how the system makes assumptions, misinterprets logic, or hides false positives behind confident outputs. Unlike infrastructure testing, this demands a cross-disciplinary approach, security, model design, and domain expertise need to work together.

For executives making AI strategic, the recommendation is direct. If you’re deploying LLMs or agents into production environments, set up dedicated AI red teams. Equip them with tooling that can observe model interactions, inject adversarial paths, and monitor system response. This is how you get ahead of emergent weaknesses before they go live, or worse, before attackers find them.

Organizations must assume AI failure as a baseline and implement safeguards accordingly

Executives need to operate on a fundamental truth about AI: system failure isn’t the exception, it’s the expected behavior. Many of the models placed into production today, whether for content generation, code suggestions, or incident response, fail with a high degree of confidence. That makes them difficult to detect and easy to trust, for the wrong reasons.

Allie Mellen, Principal Analyst at Forrester, was clear on this at the 2025 Security and Risk Summit. AI-generated outputs today are producing false positives in critical environments, especially during investigations and response workflows. These aren’t passive errors. Models present them with certainty, triggering unwarranted actions or diverting resources away from legitimate incidents.

Columbia University’s research supports this. Their testing of eight major generative AI models, including ChatGPT and Gemini, showed a 60% failure rate. That’s not a small margin. These tools are failing more often than they succeed.

The leadership takeaway is simple: build systems assuming consistent inaccuracy. This doesn’t mean rejecting AI, far from it. It means deploying AI with full knowledge of its limitations and compensating for them with process safeguards, redundancy, and human verification wherever risk exists. Automated outputs shouldn’t bypass judgment. They should inform it.

AI accuracy will improve, but right now it’s not at a level where unsupervised decision-making is acceptable in high-impact domains. Set a failure budget internally that reflects expected error margins. Assign responsibility for reviewing AI-generated content and incorporate confidence scoring into operational pipelines. Overconfidence in AI is a risk. Planning for error is a strategy.

Over-reliance on automation and legacy “Trusted” infrastructures is a growing threat

Many organizations still rely on infrastructure built for a different era, one where systems operated at human speed and trust relationships were static. That foundation is breaking under the weight of autonomous, high-velocity AI systems. The assumption that automation equals improvement is no longer valid when trust is left unverified.

Jeff Pollard, VP and Principal Analyst at Forrester, delivered a strong warning: “Guardrails don’t make agents safe; they make them fail silently.” Carnegie Mellon’s benchmark efforts support this message. When guardrails and safety constraints were applied to advanced agents, failure rates increased, often exceeding 90%. That’s a performance collapse, not a containment measure.

The risk with legacy systems and blind trust in automation is their inability to handle compounding error. These systems don’t question the AI’s output, they execute based on historical configuration or static verification logic. That opens the door to exploitation. If an attacker compromises an AI agent’s process logic and legacy systems don’t challenge that logic, large-scale impacts can follow fast.

The necessary shift for C-level leaders is continuous verification. Every node in automation, from the triggering event to the final action, must be observable, audit-able, and interruptible. Don’t count on guardrails to correct course mid-flight. They weren’t designed to assess intelligent decision layers generated by LLMs.

Rebuild trust assumptions into active control systems. Don’t defer validation to once-a-year audits. Embed live monitoring and real-time risk scoring as part of AI and automation oversight. That level of diligence ensures the scale and speed of AI doesn’t outpace your ability to control it.

In conclusion

AI is moving fast, faster than most governance frameworks, security protocols, and legacy systems can handle. That’s a leadership issue, not a technical one. Models fail. Agents act without supervision. And automation, if left unchecked, becomes a silent liability. These aren’t edge cases; they’re becoming the norm.

What we’re dealing with now isn’t just faulty outputs or biased predictions. It’s systemic exposure, AI agents creating identities that aren’t tracked, shipping insecure code into production, triggering alerts based on hallucinations, and bypassing controls that were never designed to monitor them.

If you don’t build AI governance specifically for this shift, someone else’s model failure becomes your operational disaster. The assumption can no longer be that the tool works. The assumption needs to be that it won’t, and plan accordingly.

Hire AI red teams. Upgrade your IAM. Monitor machine identities like you would your top executives. And most importantly, stop assuming old systems will keep you safe in a new environment. The risk profile has already changed. It’s time the executive playbook did too.

Alexander Procter

January 21, 2026

12 Min