Skip to main content
DeepSeekmultimodal-reasoningcomputer-vision

DeepSeek Shifts Reasoning Toward the Visual Domain

DeepSeek introduced Thinking with Visual Primitives, a fresh approach where models reason using visual points and boxes rather than generating long textual chains of thought. For businesses, this AI automation breakthrough is vital, as it can drastically reduce multimodal pipeline costs and make spatial reasoning much more reliable.

Technical Context

I genuinely appreciate concepts like this: not just another screen-filling CoT, but an attempt to alter the core mechanics of reasoning itself. In "Thinking with Visual Primitives," the model uses points and bounding boxes as atomic units of thought—literally "pointing" at objects during inference. For AI implementation, this is much more intriguing than merely pouring extra tokens into textual deliberation.

The core issue is that standard textual reasoning suffers from a frustrating "Reference Gap." By the time a model verbally explains exactly which tiny object to the left of the red block it means, it loses precision. Here, the process is tied directly to coordinates, making intermediate steps shorter and much clearer for the model itself.

I specifically noted two things. First: visual tracking is built natively into the reasoning trace, not bolted on as an afterthought. Second: the documentation mentions a KV-cache compression scheme where every 4 visual tokens are compressed into a single entry, which looks like a highly practical move for lengthy multimodal runs.

According to project claims, this approach yields strong results in counting and spatial reasoning while consuming a smaller image-token budget. I wouldn't jump to conclusions just yet, though: the repository has already been taken down, meaning we have to rely on the technical report and independent reproductions rather than glossy charts. The direction itself, however, looks very promising, especially given the growing fatigue with textual reasoning spanning hundreds of thousands of tokens.

What This Means for Automation

The first benefit is obvious: cheaper inference in scenarios where the model needs to see accurately rather than chat endlessly. Photo inspection, object counting, visual auditing, and processing schematics or warehouse footage fit this profile almost perfectly.

The second point is architectural. If reasoning is anchored to coordinates, AI integration into business processes becomes much cleaner. It is easier to debug errors, clearer to see exactly which part of a frame the model stumbled upon, and simpler to build human-in-the-loop systems.

Who wins? Teams dealing with massive image volumes and expensive multimodal inference. Who loses? Those hoping to solve all spatial tasks with a single massive LLM lacking proper visual logic.

I wouldn't call this a revolution just yet, but rather a very solid shift in the right direction. And yes, this is exactly where the demo ends and real AI solutions architecture begins: you have to assemble a pipeline, test its resilience, and calculate the cost of failure. If your product requires a model to genuinely "look and understand" instead of simulating comprehension via text, let's analyze it with your data. At Nahornyi AI Lab, we build AI automation precisely where a single inaccurate visual reference could otherwise turn into a costly operational nightmare.

We previously analyzed the mechanics of extended reasoning and the associated context costs using Claude Opus 4.6 as an example. Understanding these limitations clearly explains why the industry is so actively seeking a replacement for long and resource-intensive text chains.

Share this article