What’s holding browser agents back from doing real work

Computer-use models lack production-grade reliability

Browser automation’s next big step isn’t about mimicking humans, it’s about achieving production-level precision. We’ve had some solid concepts surface, like OpenAI’s Operator in early 2025. It showed an AI using a browser just like a person. Mouse movements, clicks, inputs, fully simulated interaction with a website. That captured a lot of attention.

But when you scale that across millions of sessions, things start to break. That’s exactly what happened. OpenAI quietly dropped Operator just eight months after launch. Turns out, the system couldn’t handle real-world inconsistencies, like rendering lags, layout shifts, or pages that load one way today and differently tomorrow. Vision-based models, which rely on screenshots and image recognition to act, can miss critical signals. And when you’re running thousands of automated browser sessions across an enterprise, even a 1% failure rate isn’t just inconvenient, it’s expensive.

Right now, computer-use models can’t meet enterprise-grade reliability thresholds. They’re fragile under variance and too slow for high-throughput tasks. Until they speed up and stabilize, they remain in the demo stage, not deployment. For C-suite leaders focused on scaling automation with confidence, this is the signal: don’t bet on vision-only agents, yet. The technology is impressive, but production demands resilience, not novelty.

OpenAI’s pivot to a hybrid model in ChatGPT Agent Mode is a pragmatic move. It acknowledges that raw mimicry isn’t enough. Reliability and control matter more, especially when automating mission-critical workflows.

DOM-based agents offer the precision that production demands

DOM-based approaches are more controlled and faster. They don’t guess where to click. Instead, they read structured layers of the page, the DOM, or Document Object Model. Think of it as inspecting the underlying blueprint of a webpage to decide how to act. That process removes a lot of guesswork, and the best part: it’s highly repeatable.

These agents don’t just read raw HTML, though. They use pre-processed snapshots that transform each section of a page into clean, labeled text. Microsoft set this up well with its Playwright MCP server, which became a standard for converting the chaotic DOM into something models can reason over. This accelerates execution and reduces error. A section of a page becomes structured, like: navigation, link “About”—link “Store”—form field “Search”—button “Search by voice.” The agent sees this snapshot and can then say: “click ref=e47.” No estimations. No rendering confusion. Just direct action.

This thing scales. With DOM-based control, automation becomes fast, stable, and deterministic, three words executives should care about deeply when building high-reliability systems. While vision-based models are still figuring out what’s clickable and when, DOM-based agents already know.

As browser automation evolves, structure wins in precision-driven environments like finance, health, logistics, fields where one wrong click costs real money. Right now, if you need accuracy at scale, DOM agents are the answer.

Hybrid systems are the most reliable automation approach in 2025

In real-world operations, no single approach is good enough on its own. Vision-based agents offer flexibility, especially with highly visual or unstructured interfaces. DOM-based agents bring precision and speed when the page structure is clean and stable. But the future of browser automation, at least in 2025, is not choosing between them. It’s running both.

Hybrid browser agents are designed to handle variability. They default to the structured path, the DOM, when a page allows it. If the DOM is missing key elements or the interface is image-driven, they switch to a visual model to interpret the interface. This dual process gives systems the flexibility to deal with a wide range of interfaces without sacrificing reliability. That’s why OpenAI moved Operator’s capabilities into ChatGPT’s Agent Mode. Instead of relying solely on vision, ChatGPT Agent operates across visual and text-based browsers, choosing whichever one performs better based on the specific requirements of the task.

From an executive vantage point, this hybrid approach dramatically reduces operational risk. You’re not hoping your automation choice fits, you’re deploying the method that works best in context. Failures drop. Scalability improves. Enterprises running thousands of sessions daily can’t afford brittle systems. Hybrid agents deliver consistency across highly dynamic environments.

Right now, this is the only practical solution that works in production at scale. It’s not an experimental idea, it’s the real standard for 2025.

Automation must learn, adapt, and improve over time

One-off task execution is not the goal of automation. The real value emerges when agents get faster and more accurate with each cycle. To get there, browser agents have to learn how to operate, adapt, and refine their behavior across time. We are beginning to see this shift.

Agents don’t just run one task and stop. The stronger systems first explore. They navigate new interfaces, attempt workflows, and log successful paths. That information gets converted into structured scripts, deterministic instructions using tools like Playwright, Selenium, or the Chrome DevTools Protocol. These scripts are not static. New large language models can now iterate on them after each run: optimizing logic, cleaning up unnecessary steps, and handling edge cases the agent may have initially missed.

This self-improving cycle is what moves automation from reactive to proactive territory. The exploration phase builds familiarity. The execution phase delivers performance. Over time, agents become faster and more consistent, not by human tuning, but through their own progressive refinement.

For enterprise leaders, this matters. Automation that adapts reduces support costs. It scales without hands-on supervision. It gets better over time by design, not by accident. If your automation strategy relies solely on fixed scripts or traditional workflows, it can’t compete with systems that self-optimize. The winners in automation will be the ones who commit not just to performance, but to learning.

Orchestrated systems will define the future of browser automation

The question shouldn’t be whether vision-based or DOM-based agents will win. The answer is already clear: both matter, but neither is enough on its own. The systems that lead going forward will be orchestrated, designed to combine vision, structure, and deterministic scripting, choosing the right tool for every interface and every context, step by step.

This is not hypothetical. ChatGPT Agent is already running both DOM and visual browser modes. It decides in real time what approach to use, based on performance, layout, and structural clarity. Visual grounding is improving. Systems like Claude 4 and opencua-72b-preview are advancing monthly. So yes, visual models are getting faster. But in 2025, full production reliability still requires structured orchestration, not modular substitution.

Enterprise environments demand this. If a form loads inconsistently or a dashboard doesn’t follow HTML standards, the agent must have a fallback method that ensures continuity. That fallback needs to be built into the system, automatically invoked when needed. Structured DOM control handles the predictable elements. Vision handles the exceptions. Deterministic scripts ensure reliable replays once workflows are learned. That three-part orchestration is what delivers end-to-end stability across workflows.

For executives, the strategic insight is simple: don’t invest in isolated models or temporary fixes. Invest in orchestration frameworks that combine logic, vision, and code execution under one unified control layer. These systems finish workflows completely, recover when interfaces shift, and learn as they go. That’s how browser automation becomes operational, not just aspirational.

Key highlights

Computer-use models lack reliability at scale: Vision-only browser agents like OpenAI’s Operator remain too fragile for production due to rendering inconsistencies and UI variability. Leaders should avoid relying solely on these models until their error rates and system rigidity are significantly improved.
DOM-based agents offer precision and consistency: Agents that navigate using structured page data (DOM) deliver faster, repeatable, and more deterministic results. Executives automating workflows at scale should favor DOM-based systems to optimize for speed and accuracy.
Hybrid browser agents deliver the best current reliability: Combining DOM methods with vision when needed ensures higher reliability across diverse interfaces. Leaders deploying automation across mixed web environments should adopt hybrid agents as the default strategy.
Automation must learn and self-optimize: Agents that convert exploratory sessions into repeatable scripts and refine workflows using code-generating models provide long-term efficiency. C-suite teams should invest in automation systems capable of continuous learning and iteration.
Future-ready systems require orchestration across models: The most scalable agents in 2025 use a blend of vision, structure, and scripting intelligently coordinated with fallback paths. Decision-makers should prioritize orchestrated platforms over siloed tools to ensure automation withstands real-world variability.