How to build an AI agent that truly delivers

Agents should be embedded within workflows

Too many so-called AI “agents” today exist in a vacuum, disconnected from real workflows. That’s not how you get meaningful results. David Loker, VP of AI at CodeRabbit, put it simply: their agent doesn’t roam freely; it works inside a defined workflow. Every step is structured. Deterministic stages like fetching diffs, analyzing code graphs, and scanning static data handle predictable parts of the job. The system activates agentic reasoning only when human-like judgment is needed. This is what separates a flashy demo from a production system that actually delivers.

Think about what that means for your organization. A controlled, workflow-embedded agent doesn’t guess, it executes. When you map your process first and identify where reasoning adds value, you create something stable and measurable. Agents aren’t replacing humans; they’re amplifying how structured processes make decisions at key points. For enterprise leaders, this approach reduces error rates, prevents unpredictable model behavior, and ensures reliability.

The Agentic Design Patterns study identified five essential subsystems every effective agent needs: perception and grounding, reasoning and world modeling, action execution, learning and adaptation, and inter-agent communication. Systems combining structured workflows with embedded agentic loops achieved an 88.8% average Goal Completion Rate across domains (Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents). Those that relied solely on chain-of-thought or tool-based reasoning lagged behind significantly.

David Loker’s team at CodeRabbit has already shown this in production. Their AI code review system handles thousands of real deployments by inserting intelligence only where reasoning is required, and nowhere else. It’s a scalable, accountable model worth learning from.

Rigorous context engineering is crucial over traditional prompt engineering

The industry keeps talking about “prompt engineering.” It’s an overhyped focus on crafting clever input instructions. The reality is, context engineering is what truly moves the needle. Loker describes this as “assembling the right information from the right sources, in the right format, at the right step.” At CodeRabbit, context for each review session isn’t random, it comes from diffs, static analysis, related files, import graphs, earlier user feedback, and documentation.

Context engineering is the hidden foundation behind an agent’s reasoning quality. Properly structured context ensures the model understands what matters most. It knows the scope, constraints, and expectations before attempting to reason. In practical business terms, this reduces wasted compute cycles, prevents off-topic reasoning, and makes every call to the model more purposeful.

This precision also enables agents to scale across workflows without losing accuracy. An academic study on Agentic Context Engineering (ACE) showed that structured, incremental context updates improved accuracy by 10.6% compared to static, rewritten prompts. When context collapsed, from 18,000 tokens down to just over 100, the model’s accuracy dropped from 66.7% to 57.1%. Context must evolve, not reset. Treat it as a living source of truth that evolves with your process.

For C-suite leaders, the message is straightforward. Strong context engineering ensures your AI investments generate real outcomes rather than noise. Instead of pouring money into one massive model or prompt tricks, design a data pipeline that continuously feeds the model exactly what it needs, no more and no less. That’s how you make an AI system actually perform when it counts.

Excess or irrelevant context can degrade AI performance

When designing intelligent agents, more data does not always mean better performance. David Loker points out that overloading a model with context, especially irrelevant or tangential information, can weaken reasoning and accuracy. This happens even when the data itself is technically correct. The model becomes distracted, losing focus on what matters.

Research backs this up. Studies from TII and Sapienza formalized what’s known as the Distracting Effect. They found that strong retrievers, systems designed to surface highly relevant data, can actually introduce more dangerous distractors. These distractors appear contextually correct but lead the model toward confident, wrong answers. The drop in performance can range from six to eleven points, depending on the type of irrelevant content. “Modal statements,” the kind that sound authoritative but hedge the truth, are particularly damaging.

For executives leading AI-driven companies, this is not an academic issue. It’s a production risk. When your model’s retrieval pipeline drags in extraneous material, you don’t just see output variation, you see operational inefficiency and decreased trust. The safer path is curation. Loker emphasizes that context selection is an essential step of context engineering: filter aggressively, and only include information that supports the model’s specific reasoning task.

This principle impacts cost and performance metrics directly. By reducing unnecessary context, you save compute resources while improving accuracy. You also gain interpretability, your team can trace exactly why an output occurred. Well-engineered filters are a cornerstone of AI systems that deliver predictable, high-quality outcomes at scale.

Overloading agents with procedural skills or comprehensive documentation can be counterproductive

Procedural knowledge can make or break an agent’s performance. The SkillsBench benchmark shows that human-curated procedural guidance, what the study calls “Agent Skills”—raises success rates significantly. But more is not always better. Systems loaded with too many skills or unfiltered documentation tend to degrade in accuracy and consistency.

The data is clear. Focused, human-written skills raised task completion by +16.2 percentage points on average. The optimal balance was two to three focused skills, with performance gains peaking at +18.6pp. Beyond that, results weakened rapidly, comprehensive documentation actually caused a -2.9pp performance loss. Self-generated skills, where the model created its own procedural knowledge, achieved no improvement and occasionally reduced accuracy (GPT‑5.2 dropped by -5.6pp). The dependency on curation remains non‑negotiable.

Enterprises should interpret this with care. It means that procedural systems, when carefully pruned and purpose-built, can outperform larger or more “advanced” models lacking precision. Proper skill selection is an efficiency multiplier, not a scaling shortcut. A smaller, well‑instructed model can outperform a larger one that drowns in conflicting or redundant guidance.

From a leadership perspective, investing in skilled human curation brings higher ROI than expanding compute or storage capacity. Treat procedural knowledge as a controlled resource. Use it where it delivers maximum leverage, and limit over-instruction that adds noise. Precision beats completeness every time in high‑performance agent design.

Different workflow stages require the use of specialized models

A single model cannot handle every task efficiently. CodeRabbit’s production setup proves it. The company uses more than ten model variants, each selected for a specific part of the workflow. The reason is simple: some tasks demand reasoning depth, while others require speed and low latency. A large reasoning model is wasted on small procedural checks, and a small one can’t manage complex contextual analysis.

David Loker, VP of AI at CodeRabbit, explains that model choice isn’t arbitrary. It’s based on latency tolerance, reasoning requirements, and cost. If a given task involves high‑frequency loops, a smaller, faster model prevents bottlenecks. Where judgment or deeper analysis is needed, a stronger reasoning model steps in. The overall workflow runs on a “best‑fit” principle, each model doing only what it’s optimized for.

This layered approach affects cost/performance balance directly. Because CodeRabbit absorbs token costs instead of passing them to customers, efficiency at every stage matters. Selecting the smallest necessary model for a step keeps expenses predictable without lowering quality. It’s a disciplined way to scale AI services while maintaining high accuracy and consistent delivery.

For executives, this principle reveals how to deploy AI at operational scale without letting cost spiral. Mixing models strategically means you don’t overpay for unnecessary computation. It also ensures system resilience; if one model underperforms, another specialized instance can handle its segment cleanly. In practice, this structure results in better output consistency, lower latency, and reduced compute overhead, all vital to delivering enterprise‑grade reliability.

Tool pipelines must be engineered deliberately with clear stages

Agents work best when the tools they use are deliberately structured. CodeRabbit doesn’t grant its agents access to a random set of utilities. Each tool is inserted at a precise stage, with a specific purpose tied to workflow logic. The process begins with deterministic operations, building code graphs, collecting diffs, and running static analysis, to assemble a foundation of factual data. Then the agentic layer, guided by the reasoning models, interprets this structured information to deliver insight.

This careful orchestration prevents chaos during task execution. The reason is straightforward: tool selection and invocation are critical failure points. Most agent errors occur not when a tool runs, but when the system picks the wrong one or fails to process its output correctly. Proper engineering includes four steps, tool discovery, selection, invocation, and result integration. Each requires defined parameters, structured error handling, and validation before results flow back into the reasoning process.

In CodeRabbit’s system, static analysis tools aren’t meant to deliver final answers. They provide awareness, helping the AI identify where potential issues might exist, not declaring them outright. This allows the reasoning model to weigh all available evidence before making a decision. Web queries fill real‑time knowledge gaps, such as documentation changes that post‑date the model’s training data.

For business leaders, this kind of deliberate tool engineering delivers operational safety and transparency. It ensures that automation supports human objectives rather than acting unpredictably. Purpose‑built pipelines reduce error rates and false positives while maximizing clarity in how results are produced. When every tool’s function is clearly defined and integrated, the organization gains technical control, operational confidence, and measurable consistency, essentials for building trustworthy enterprise AI systems.

Memory systems require active curation and structured retrieval

Long-term memory makes an agent smarter only if the information it keeps is structured and filtered. CodeRabbit doesn’t let its system accumulate random logs. Instead, it curates memory to record actionable information, developer feedback, code review history, and company-specific rules. This memory isn’t static. It evolves with usage, helping the agent contextualize new situations accurately without retraining the base models.

David Loker, VP of AI at CodeRabbit, describes how developers’ feedback flows directly into memory updates. When users reject certain comments or express preferences for workflow standards, that data becomes part of the contextual store. Next time a similar scenario appears, the agent recalls this feedback to adjust its behavior. This approach scales mass customization across organizations without the need for fine-tuning new models for each client.

Research supports this hands-on approach to memory management. The MemInsight paper reported that semantically curated memory structures improved recall performance by +34% compared to naive retrieval-augmented generation (RAG) setups. The MAIN-RAG study adds that filtering retrieved data through multiple agents before feeding it into the reasoning model significantly lifts accuracy and reduces cognitive noise. The conclusion is straightforward: effective memory depends as much on intelligent curation as on retrieval capability.

For executives, the takeaway is that structured memory saves both time and cost. It allows one AI core to serve multiple organizations with distinct standards, improving adaptability while minimizing maintenance. This strategy turns organizational nuance into a competitive advantage, your AI system becomes smarter because it learns selectively and improves continuously, not because it stores everything.

Independent and multilayered verification enhances output quality

Quality in AI output doesn’t come from more computation, it comes from verification that’s independent and layered. CodeRabbit achieves this through what David Loker calls “post-review verification.” One model generates the review, and another model checks it. The verification system runs independently, evaluating accuracy, context alignment, and factual grounding. If the first model’s output contains errors, the second catches and filters them.

This architecture ensures accountability across all reasoning steps. Loker notes that using separate models works better than relying on self-validation. Each model has distinct training distributions and tendencies, which makes their verification interplay more effective. The goal is clear: detect hallucinations, ensure files and code references are real, and discard speculative or contradictory outputs through what Loker refers to as “de-noising.”

Peer-reviewed frameworks support this structure. In the Agentic Context Engineering (ACE) framework, this separation between the Generator (creator model) and the Reflector (verifier) is fundamental. Ablation studies show that removing the Reflector leads to significant performance degradation. The model feedback loop, supported by human and environmental signals, helps align results with real conditions.

For C-suite leaders, this is a signal to treat verification as design, not as audit. Independent verification layers make AI systems more trustworthy and easier to maintain within quality compliance frameworks. This structure guarantees that no single model becomes a critical failure point. It’s a disciplined approach to scaling AI outputs while keeping accuracy, cost, and risk under control.

Continuous evaluation is essential for sustaining AI performance

In modern AI operations, evaluation is never finished. David Loker, VP of AI at CodeRabbit, describes how the company treats performance monitoring as an ongoing function rather than a final step. Models change, APIs update, and market demands shift. To maintain reliability, every update must be verified, measured, and fine-tuned through structured evaluation layers.

CodeRabbit’s system checks both quantitative and qualitative metrics. Quantitatively, the team measures recall, precision, and signal-to-noise ratios. Fewer but higher-quality results indicate real improvement. Qualitatively, the team reviews tone consistency, comment clarity, and context accuracy. They monitor model rollout through staged deployments, observing whether teams adopt or reject new behaviors. These checks ensure that technical metrics align with user experience, not only system benchmarks.

Business leaders should recognize the strategic value in this continuous loop. Models evolve every few months, but leadership frameworks often move slower. Relying on a static evaluation approach means falling behind technological updates. Loker compares model changes to mandatory library upgrades— unavoidable, frequent, and often disruptive. Systems that evaluate and adapt regularly stay consistent through these shifts and offer more predictable performance to end users.

Research confirms the value of sustained evaluation. The Outcome-Oriented Evaluation of AI Agents framework measures across eleven operational dimensions, covering both efficiency and business value metrics such as Goal Completion Rate, ROI, and Multi-Step Task Resilience. The findings are clear: no architecture performs best across every metric. Continuous evaluation identifies the optimal fit per domain and maintains quality against drift.

Andrej Karpathy, Greg Brockman, and Mike Krieger have repeatedly reinforced this point: ongoing evaluation often provides all the signal you need to maintain reliability. For executives, it’s a disciplined form of quality assurance, a guarantee that each system update contributes toward measurable progress instead of unpredictable change.

Multi-agent coordination topology significantly impacts performance

Coordination among agents determines how effectively complex systems perform. At CodeRabbit, multiple agents interact to perform different parts of a workflow, with defined roles and responsibilities. The success of such systems depends heavily on how communication between agents is structured.

David Loker outlines two “agentic loops” in CodeRabbit’s production system, one before the primary review, using a heavy reasoning model, and a second afterwards for post-analysis. This controlled interplay improves feedback quality and decision reliability. Research expands on this by showing that communication topology within multi-agent systems is a measurable factor of success. Graph-based structures, where agents can exchange information freely, outperform restrictive topologies like star or chain configurations where communication is centralized or linear.

The MultiAgentBench study quantifies this impact. Adding a planning phase, where agents first decide how they will collaborate, boosted milestone achievement by roughly +3%. Graph configurations achieved higher success rates on complex reasoning and multi-step problem solving due to real-time information exchange. Systems using static or hierarchical communication displayed weaker adaptability.

For C-suite executives, this insight points to a design principle: inter-agent communication must be intentional, not incidental. In large-scale AI deployments, agents handling different modules, analysis, reasoning, validation, perform more efficiently when allowed structured but flexible communication. Making collaboration explicit improves efficiency, consistency, and interpretability across the entire workflow.

Organizations that design their agent topologies deliberately will achieve higher system resilience and clearer performance outcomes. The design of coordination is not operational detail, it’s a strategic layer of control that directly influences the productivity of intelligent systems at scale.

Following a structured checklist enhances agent production success

A reliable AI system is not the outcome of experimentation, it is the result of a structured process repeated with discipline. CodeRabbit’s experience in deploying production-grade agents shows that predictable success depends on following a comprehensive checklist grounded in research and operational consistency.

The checklist begins with establishing what can be evaluated, both in terms of business outcomes and technical performance. Evaluation must connect to measurable goals, recall precision, workflow completion rate, and user satisfaction. Next is workflow mapping. Each step should be categorized as deterministic or agentic. Deterministic steps run on fixed logic, while agentic steps require reasoning. This ensures that intelligence is only applied where human-like judgment genuinely matters.

Context engineering comes next. It involves selecting data with precision, balancing quantity and relevance, and structuring it for clarity. The system must include only the most relevant data, since excessive or tangential context degrades model performance. Curating procedural knowledge follows the same logic: retain two to three focused, human-verified skill modules and reject over-documented or self-generated ones that introduce redundancy.

Model selection, tool integration, and memory management are sequential steps built on deliberate engineering. Different tasks require different models, chosen for their reasoning strength, latency performance, and cost efficiency. Tools must be implemented with clear discovery, selection, invocation, and integration phases. Memory systems must be curated so they evolve intelligently, retaining only structured information that supports precision in future reasoning.

Verification acts as a safeguard, ensuring outputs are reviewed by separate models and verified across multiple feedback channels, human, environmental, and system-based. Finally, for systems using multiple agents, collaboration topology must be planned early. Graph-based coordination offers flexibility and higher reliability in complex reasoning chains.

Executives should view this structured approach not as technical overhead but as a stability framework. Each component of the checklist builds measurable control into the system. It reduces dependency on any single model, ensures consistency across changing architectures, and keeps performance aligned with business metrics.

The wider implication is strategic. AI agents mature through iterative refinement, not abrupt redesign. Following this checklist allows organizations to engineer agents that maintain performance continuity as technologies evolve. As models, tools, and data infrastructures advance, this structure ensures your AI systems remain resilient, scalable, and cost-efficient, producing reliable outcomes long after deployment.

The bottom line

AI agents that truly work aren’t the result of size, hype, or clever prompts. They come from structure, precision, and discipline. The companies leading this shift are treating AI not as an experiment but as infrastructure, something engineered, tested, and continuously refined.

For business leaders, the lesson is simple. Success with AI depends on building systems that are modular, measurable, and resilient to change. Curated context, independent verification, and ongoing evaluation are not technical details, they are business controls that protect performance and cost efficiency.

Decisions about AI architecture now shape return on investment tomorrow. Embedding intelligence into workflows, rather than chasing autonomous constructs, creates consistency. The organizations that adopt this structured mindset will transform automation into a lasting advantage.

In the end, practical AI isn’t about chasing the latest model. It’s about creating a framework that keeps improving, adapting, and delivering value long after deployment. The most durable systems are built on continuous learning and responsible engineering, principles that define real intelligence at scale.