LLM agents underperform in handling complex CRM tasks requiring multiple steps

Right now, large language models, think GPT-based systems, can handle simple, clearly defined tasks fairly well. You give the model a single command, it executes, and it does the job most of the time. In fact, Salesforce’s internal study, led by AI scientist Kung-Hsiang Huang, showed a 58% success rate on these one-step interactions. So, they’re not perfect, but decent enough in low-complexity environments.

Now stretch that task into something with multiple steps, a customer issue that needs clarification, additional data, and some back-and-forth. That’s where the wheels come off. Performance drops to just 35% on multi-step requests. Why? These systems rarely ask the right follow-up questions, and they don’t know when they’re missing critical information. They aren’t context-aware across extended dialogues. When the task is ambiguous or underspecified, complex CRM tasks often are, the models just freeze up or go the wrong way entirely.

This presents a real limitation for enterprise use. Many customer service inquiries are not clearly defined at the outset. They need to be clarified through dialogue, something these AI agents aren’t particularly good at today. This isn’t just a technical problem. For businesses, it means handing over customer experience to these tools could harm your brand if queries aren’t handled with a basic understanding of nuance.

That’s a problem we’re going to solve over time. The models will get better at dynamic interaction through updates and better training. But right now, handing complex customer communication, multi-turn CRM, to an unsupervised AI is not wise.

Executives evaluating these technologies should understand the current limits clearly. These tools are effective in linear, low-context environments. But for now, high-context, evolving scenarios still need human judgment at the center. Until AI agents can match humans in sense-making, real CRM at scale will remain a hybrid game.

LLM agents exhibit strong performance in executing well-defined, single-turn workflows

When the task is simple, clearly defined, and doesn’t require more than one step, large language models work well. During the same Salesforce study conducted by AI scientist Kung-Hsiang Huang, the best-performing AI agents showed an 83% success rate for single-turn workflow tasks. That’s high enough to be genuinely useful in a production environment, especially for predictable and routine actions.

In these scenarios, the model doesn’t need to do much interpretation. It receives a clear instruction and delivers a response that meets expectations. There’s no confusion, no need to ask users for clarification, and no deviation from the original goal. That’s where current LLM agents are at their best: high precision under tightly scoped commands.

This performance level has clear value. For business leaders, this means there’s immediate ROI when placing these models into roles where task boundaries are clear and outputs are consistent. Think: triggering reports, updating a CRM field, scheduling follow-ups, tasks where speed and repeatability count more than adaptive reasoning.

But effectiveness here depends entirely on reducing ambiguity up-front. If inputs don’t match the expected structure, even slightly, the model’s accuracy drops. LLMs don’t truly understand context; they match patterns. So your success rate is tied closely to how predictable the task input is and how well workflows are designed around the AI’s current capabilities.

For executives, the play here is identifying those segments of your customer workflow that are clean and repeatable. That’s where automation with LLMs can be confidently deployed today. Everything else, particularly tasks requiring judgment, inference, or user-driven change, should still have human oversight. Automate the narrow, not the nuanced.

LLMs lack an inherent sense of confidentiality, presenting substantial privacy and security risks

Let’s be direct, current large language models don’t understand confidentiality. They can handle data, but they don’t know what should be kept private unless explicitly told. That’s a critical failure point, especially in business contexts where customer data, financials, or proprietary information must be protected by default, not by exception.

The Salesforce study, led by AI scientist Kung-Hsiang Huang, highlighted this: LLM agents perform poorly when it comes to managing sensitive information. You can ask them to avoid sharing or acting on confidential data through specific prompts. That works in short bursts. But over longer conversations, those instructions lose strength, and the model tends to forget what it was told. In other words, your privacy safeguard fades the more the agent talks, and CRM almost always involves ongoing dialogue.

The risk compounds when you work with open-source models. They often struggle even more with layered or complex instructions, which makes them worse at maintaining confidentiality in nuanced scenarios. These tools don’t have a built-in framework for identifying customer PII or internal business data, and that’s a serious issue.

For executives, this isn’t just a technology bug. It’s a liability. Without structural safety protocols built in, relying on LLMs around sensitive workflows opens real threats: data leaks, regulatory breaches, and brand damage. Most organizations can’t afford that battle if something goes wrong at scale.

You need controls, strict ones. If you’re thinking about integrating AI into environments with sensitive customer or organizational data, do it with safeguards in place. And if those safeguards aren’t proven under pressure, wait. This isn’t about being cautious; it’s about being rational. Without native confidentiality awareness, today’s AI is not trustworthy in data-sensitive environments.

LLMs remain unsuitable for high-stakes, data-heavy CRM applications

The reality is simple. Large language models aren’t ready for mission-critical CRM roles, not yet. They still lack core abilities like stable reasoning over long conversations, maintaining instruction consistency, and reliably distinguishing between sensitive and public data. That’s a problem if you’re thinking about deploying these systems in customer touchpoints where both nuance and data protection matter.

The Salesforce research, under Kung-Hsiang Huang, made this clear. When you try to patch these gaps with prompt engineering, adding safeguards manually through instructions, performance degrades. Not only do the models become less effective at completing tasks, but those safety prompts also lose potency over time in longer interactions. It’s not scalable, and definitely not dependable when the stakes are high.

This makes a strong case for leaders to resist early over-adoption. You can use LLMs now, but they need to be placed in well-controlled, well-defined environments. High-stakes scenarios involving customer data, legal risk, or brand impact demand better safety architecture than what these models currently support.

The capabilities will improve. Rapid iteration in AI development is real, and the reasoning gaps we’re seeing right now will close over time with better tooling, longer context windows, and updated model architectures. But as of today, placing these systems into roles that handle sensitive workflows without deep supervision is premature.

The smart move for executives is strategic adoption, use LLMs where they already provide value, and hold off where the risk equation doesn’t support the technology’s current limits. Let performance and safety standards define the rollout schedule.

Main highlights

  • AI struggles with multi-step CRM tasks: LLM agents only succeed 35% of the time on complex, multi-step CRM actions due to limited reasoning and poor clarification abilities. Leaders should avoid assigning AI unsupervised tasks that depend on dynamic dialogue or incomplete user input.
  • AI succeeds in simple, structured workflows: With an 83% success rate in single-turn tasks, LLMs are effective in predictable workflows. Executives should focus AI deployment on clearly defined, repetitive CRM functions to drive short-term efficiency.
  • Confidentiality is a weak spot in current LLMs: Most AI agents lack inherent awareness of what counts as confidential, posing data privacy and compliance risks. Businesses handling sensitive information should delay AI integration until stronger, tested safeguards are in place.
  • Current models aren’t enterprise-ready for sensitive tasks: Prompt-based safety solutions degrade over time and hurt accuracy, making LLMs unreliable for high-stakes CRM. Decision-makers should adopt a selective rollout strategy, using AI only in low-risk environments until reasoning and security improve.

Alexander Procter

July 31, 2025

7 Min