Open-weight AI models that perform well in isolated tests collapse under sustained, multi-turn adversarial attacks
There’s a blind spot in how most organizations are evaluating AI safety today, and it’s dangerous. AI models often get high marks when tested with one-off malicious queries. Makes sense. Most benchmarks are built on these single-turn attacks. But real-world adversaries don’t stop after one try. They keep pushing, probing, adapting.
Cisco’s AI Threat Research team showed this clearly. On average, open-weight models block 87% of single-turn attacks. But when an attacker continues the conversation across multiple turns, rewording, escalating, or reframing the same goal, that success rate falls apart. The average attack success rate jumps from 13% to 64%. For some models, like Mistral’s Large-2, it spikes to nearly 93%.
AI safety isn’t about handling the first bad prompt. It’s about handling all the ones that come after. Most CISOs and CTOs haven’t accounted for that. They’re still testing for isolated failures instead of probing for persistent weaknesses that unfold sentence by sentence.
Executives need to understand this: if your AI systems are passing only single-turn evaluations, they’re giving you a false sense of security. The real test is whether that system holds up across a true conversation. If it doesn’t, it won’t survive in production environments where failures have reputational and regulatory consequences. The danger isn’t just theoretical. It’s measurable.
As DJ Sampath, SVP at Cisco’s AI software platform group, said, “When you go from single-turn to multi-turn, all of a sudden these models are starting to display vulnerabilities where the attacks are succeeding, almost 80% in some cases.” That’s not a minor oversight, it’s a structural flaw. And it needs to be addressed now, not after deployment.
Multi-turn attack strategies leverage natural conversational dynamics to systematically circumvent AI safety mechanisms
The attackers aren’t using magic tricks. They’re just behaving like humans. That’s the fundamental problem. They break harmful requests into small chunks, stretch their goals over long conversations, or rephrase rejected requests until the AI gives in. The attack methods are familiar because they’re based on how people actually communicate, clarify, build rapport, reword, escalate. And for now, most models fall for it.
Cisco tested five multi-turn attack methods: breaking the message into parts (information decomposition), being deliberately vague (contextual ambiguity), slowly escalating towards harmful ends (crescendo), pretending to be someone else (role-play), and persistently reframing until success (refusal reframe). Every single one of those worked, reliably. Against a model like Mistral Large-2, these approaches had success rates over 89%, some as high as 95%.
This isn’t about complexity. The attacks aren’t complex. The defense strategies are weak because current models aren’t built to maintain context over time. They’re optimized for sounding smart one message at a time, not for resisting consistent pressure over several exchanges. That’s the real issue.
For business leaders, here’s what matters: the threat isn’t exotic. It’s persistent. And if your AI tools can be bypassed just by someone behaving naturally for long enough, then your safeguards aren’t really safeguards. They’re just temporary delays.
Any model your team deploys needs to be evaluated not just for its IQ, but for its stamina. Can it hold its position when probed repeatedly in slightly different ways? Because that’s how modern adversaries operate. Any serious deployment needs multi-turn resilience baked in from day one. Not after the headlines hit.
The disparity in security gaps among AI models is closely tied to development philosophies
The difference in security effectiveness across AI models isn’t random, it comes down to who built them and what they prioritized. Some companies invest heavily in safety protocols during development. Others focus on pushing raw capability and flexibility, leaving security tuning for the customer to figure out post-deployment.
Cisco’s research makes this pattern clear. Models built by labs that emphasize alignment and responsible use, like Google’s Gemma-3-1B-IT, show minimal differences between single and multi-turn vulnerabilities. Gemma posted just a 10.53% gap between those threat profiles, which is what you’d expect when safety is structured and verified in development.
On the other hand, “capability-first” models show dramatic security drop-offs. Meta’s Llama 3.3-70B-Instruct had a 70.32% gap. Alibaba’s Qwen3-32B had the highest at 73.48%. Mistral’s Large-2, which openly lacks moderation mechanisms, posted a 70.81% gap. That’s not a coincidence. These models are being built fast and optimized for flexibility, fine-tuning, and performance. Security is left as optional.
From a leadership perspective, there’s nothing inherently wrong with choosing a high-capability model. But you need to go in with eyes open. If you pick those kinds of models, you’re picking responsibility for the security layer too. Waiting for someone else to fix the problem, post-launch or post-incident, doesn’t work. These gaps are design outcomes. The only question is whether your team is resourced and ready to fill them at runtime.
Open-weight AI models remain strategically valuable yet necessitate supplementary security measures
Open-weight AI models are rapidly becoming foundational tech across sectors. They’re customizable, fast to deploy, and avoid vendor lock-in. These are real operational advantages, particularly for companies moving fast in competitive spaces. Their openness is the reason enterprise adoption is accelerating.
But let’s be honest, openness also creates exposure. What you gain in flexibility, you give up in out-of-the-box protection. That doesn’t mean they aren’t worth using. It means don’t deploy them blind.
Cisco isn’t just pointing fingers here. They’ve released open-weight models themselves, like Foundation-Sec-8B, through platforms such as Hugging Face. DJ Sampath, SVP at Cisco, has been clear: “Open source has its own set of drawbacks. When you start to pull a model that is open weight, you have to think through what the security implications are and make sure that you’re constantly putting the right types of guardrails around the model.”
Executives should treat open-weight models as powerful tools that require conscious risk management. These tools will get you to market faster and allow adaptation. But because they don’t include robust, built-in defenses, your security team has to compensate for that with runtime protections, real-time monitoring, and hardened deployment strategies.
If your team is counting solely on filters already built into these open models, you’re in trouble. Protection isn’t baked in. But engineered properly, open-weight models can still be deployed safely. The key is taking full ownership of their security architecture, up front.
A limited set of subthreat categories account for the majority of vulnerabilities in open-weight AI models
Not all threats are created equal. In Cisco’s research, just 15 subcategories were responsible for most successful attacks across all tested models. That’s actionable. If you’re running AI in production, these are your high-priority targets for mitigation.
The top vulnerabilities include malicious infrastructure operations (38.8% average success rate), gold trafficking (33.8%), network attack operations (32.5%), and investment fraud (31.2%). These threat types weren’t randomly susceptible, they showed consistently high results across the board. That makes them the logical starting point for defensive tuning, preemptive filtering, and policy enforcement.
In practical terms, this means enterprises don’t need to solve for every possible misuse case from day one. You can instead take a focused approach: identify what the model is most likely to fail on, and deploy targeted safeguards in those areas. This delivers immediate impact on your risk profile without requiring full security coverage upfront.
For leadership, this focuses the conversation. It’s not about broad hypotheticals. It’s about known weak spots. Prioritize them sharply and apply pressure there, where it counts most. Disproportionate gains in safety come from concentrated effort in high-risk zones. That’s where you start to close the reality gap between benchmark compliance and production-grade resilience.
Robust security against multi-turn attacks demands a multifaceted defense strategy
Most AI models on the market today weren’t built to defend themselves across extended interactions. That’s not a bug, it’s a development trade-off. Which means responsibility for security falls on you and your team. The good news: there’s a clear set of tactics you can apply right now to harden your systems.
Cisco’s research outlines six top-level defenses to prioritize:
- Context-aware guardrails, so the model tracks state and meaning across conversation turns.
- Model-agnostic runtime protections, external layers that block harmful content regardless of model architecture.
- Continuous red-teaming, to simulate multi-turn adversarial behavior and uncover real weaknesses ahead of attackers.
- Hardened system prompts, to resist instruction overrides during longer sessions.
- Full forensic logging, for complete incident tracking and auditing.
- Threat-specific mitigations, targeting the most vulnerable subcategories, as surfaced by recent data.
These aren’t optional checkboxes. They’re foundational components for any production AI system operating at scale. And they need to operate in concert, not as isolated add-ons, but as part of an integrated posture.
Leadership needs to stop treating AI security as an afterthought or secondary responsibility. If you want to deploy AI widely across your organization, to drive productivity, improve operations, reduce latency, then you need to secure that usage from the inside out. DJ Sampath said it clearly: “If we have the ability to see prompt injection attacks and block them, I can then unlock and unleash AI adoption in a fundamentally different fashion.”
Security isn’t a bottleneck. It’s what makes scaled adoption possible.
Enterprises must shift from reactive measures to proactive, real-time defense strategies to safeguard AI deployments
Many organizations are still in wait-and-see mode with AI. That’s a mistake. The threat landscape is not slowing down, it’s evolving every few weeks. If you’re holding off until there’s a “final version” of AI or some standard baseline to compare against, you’re misreading the pace and nature of this space.
Adversarial techniques aren’t static. Attackers adapt quickly, and models that appear secure under current benchmarks can become exposed overnight. Cisco’s research showed how rapidly previously unknown patterns, like multi-turn persistence, can overwhelm model safeguards. If your enterprise AI strategy doesn’t include real-time adaptation and continuous testing, you’re not ready.
Waiting introduces exposure. Every unmonitored conversation, every untested workload, is a point of risk. You need security validation that is ongoing, not a one-time certification. You need teams that simulate persistent attacks internally before they’re used in the wild. Assuming your AI will fail under pressure unless proven otherwise is the right mindset.
DJ Sampath, SVP at Cisco, put it plainly: “A lot of folks are in this holding pattern, waiting for AI to settle down. That is the wrong way to think about this. Every couple of weeks, something dramatic happens that resets that frame. Pick a partner and start doubling down.”
If you want to scale AI in your business, across operations, customer experience, or internal productivity, it must be secured at the system level. That effort can’t be deferred. The longer an AI model runs in production without full-spectrum defenses, the greater the surface area for adversarial learning. Enterprises that delay security are giving attackers a head start.
Now is the right time to move. Not when the model is perfect. Not when consensus arrives. Right now, while you still control the pace.
Recap
If you’re in charge of deploying AI inside your company, assume this: your models are more vulnerable than they appear. Single-prompt results look clean on paper, but attackers don’t operate in isolation. They persist, escalate, and adapt. Your defenses need to do the same.
The real gap isn’t just technical. It’s strategic. Most benchmarks tell you if a model works once, not whether it holds up over time. That’s a serious blind spot, especially for enterprises scaling AI across teams, customers, and systems.
Security isn’t a blocker. It’s the enabler. Without guardrails that protect across full conversations, you’re running unstable systems at production scale. That’s a bad tradeoff both for safety and for business continuity. But the good news is you’re not stuck. The fixes are known, the patterns are clear, and the tools to harden your stack are already in the market.
Strong adoption starts with strong defense. Build for persistence, not just performance. And stop assuming your benchmarks mean you’re protected. In reality, one blocked prompt doesn’t matter if ten others get through.


