Technical Context
I regularly see the same request: “I want a local agent that confidently manages the desktop without the cloud.” Almost always, the first idea is to take a SotA multimodal LLM, feed it screenshots, and define clicks via coordinates (x,y) or an overlaid grid.
I have analyzed such prototypes, and every time I hit the same basic disconnect: vision solves “find the button and click” well, but starts to degrade on “scroll and read.” Scrolling turns into a loop: screenshot → text/element recognition → decision → scroll → new screenshot. Latency and cost grow linearly with page length, while quality drops non-linearly.
The second problem is the fragility of coordinates. Any layout shift, scaling, different DPI, tooltips, animations, or just a “slightly different font” breaks the binding. The model is forced to constantly re-“watch” the screen because it lacks stable semantic anchors.
The third is the computational price of locality. Comfortable work with multimodal models usually requires a serious GPU (often 24GB VRAM+), and even then, you pay with time: context from images is heavy, and repeated passes over the screen eat up bandwidth.
When I design AI architecture for desktop automation, I almost always try to move away from “pixels” to structure: Accessibility tree, UI Automation, DOM (in web controls), and sometimes application APIs if they exist. In such representations, an element is a role, name, state, and hierarchy, not an area on the screen.
Impact on Business and Automation
If you build “AI automation” on a vision grid, you are buying yourself two expense items: expensive runtime and expensive support. Runtime is expensive due to multimodal inference and frequent observation iterations. Support is expensive because any interface updates turn into regression that cannot be stably covered by a “selector”—only by a new observation cycle.
I have seen companies eventually limit the agent to short scenarios: open an app, click 2–3 buttons, fill out a form. This works, but only until you need an “operator”: reading long lists, comparing strings, scrolling tables, gathering data from multiple windows.
Who wins from structural access? Those whose processes rely on repeatable actions: back-office, procurement, logistics, accounting reconciliations, request processing, quality control. There, semantic control (via accessibility/DOM) provides predictability and speed, and the model is used for decision-making, not for guessing pixels.
Who loses? Teams trying to “do AI automation” without an integration layer, hoping the LLM will “see and figure it out” on its own. In our projects at Nahornyi AI Lab, I establish a separate layer of tools: UI structure extraction, normalization into a single format, safe actions, and only then—agent planning.
In the end, “AI adoption” becomes an engineering task: not choosing the smartest model, but assembling a control loop where the model receives stable primitives (find, focus, read, set, scrollTo, queryTable) and doesn't waste tokens on visual noise.
Strategic Vision and Deeper Dive
My forecast is simple: local desktop agents will become mass-market not when an even more “seeing” model is released, but when standardized agent primitives over UI structure appear. Vision will remain a backup sensor: for apps without accessibility, for VDI/remote desktops, for non-standard canvases.
I am already embedding a hybrid pattern into the architecture of AI solutions for business: “structure by default, vision by necessity.” In practice, it looks like this: the agent first works via accessibility/DOM, and only if an element is not found or content is rendered as an image does the visual mode with OCR and verification kick in.
There is one more nuance many miss: security and audit. A coordinate click is hard to explain and reproduce. But the action “click Submit button in InvoiceApproval window by accessibility-id” is easily logged, reviewed, and passes compliance. For the real sector, this is often the decisive argument in favor of “AI integration” via structural interfaces.
If you are choosing a direction now, I would not invest months in a pure vision agent to “scroll and read.” I would invest in a layer of access to the UI tree, in a proper toolset, and in quality control of agent actions. This way, you get speed, stability, and manageable total cost of ownership.
This analysis was prepared by Vadim Nahornyi—Lead Expert at Nahornyi AI Lab on AI automation and AI implementation architecture in the real sector. If you are planning a local desktop agent or want to replace fragile vision scenarios with structural integration, I invite you to discuss the task: I will analyze your process, propose a target AI architecture, and an implementation plan with cost and risk forecasts.