Skip to main content
OpenCV 5LLMVLM

OpenCV 5 Learns to Run LLMs and VLMs Inside Itself

OpenCV 5 now directly runs LLMs and VLMs inside its DNN module using ONNX graphs, native tokenization, and KV-cache. This advancement matters for businesses because it simplifies and accelerates AI integration into local computer vision pipelines, reducing the need for external API calls and making on-device AI more practical and robust.

Technical Context

I dug into OpenCV 5 not out of curiosity, but because such things immediately impact practice: AI integration and AI automation on the edge can now be assembled without extra layers of separate runtimes and APIs. And that's where OpenCV really surprised me.

The main change isn't in a flashy press release but in the DNN engine. It was rebuilt around a typed operation graph with shape inference, constant folding, and fusion. As a result, ONNX operator coverage jumped from about 22% in the 4.x branch to over 80%, and this opens the door to modern transformer models with dynamic shapes.

Then the most interesting part begins. OpenCV 5 can run LLMs and VLMs through the familiar Net API, not through a separate chat framework. The idea is roughly: load the model, feed the input, get inference — only now it's not just a detector or segmentation, but Qwen 2.5, Gemma 3, PaliGemma, and similar.

For autoregression, they added native tokenization and KV-cache. Without this, any attempt to run an LLM inside a classic CV library would look like a weird demo trick, not a working path. Here you can already see that the team is aiming not for hype, but for a solid inference pipeline.

But there's an important caveat I specifically noted: this is not a replacement for everything and not a universal environment for agent systems. Based on current materials, you need to build with WITH_ONNXRUNTIME=ON, meaning there's still a dependency on ONNX Runtime. It's just now embedded into a more unified OpenCV flow, and for many scenarios this greatly simplifies the architecture.

What This Means for Business and Automation

I see three direct consequences. First: local vision pipelines gain contextual understanding of images without tapping external APIs. For private data, manufacturing, and healthcare, this is very attractive.

Second: AI solution development for cameras, terminals, robots, and embedded scenarios becomes simpler across the stack. Fewer dependencies, fewer failure points, faster maintenance.

Third: teams that already have OpenCV in production will benefit. Those who assume any LLM will magically run inside the library without selecting the right ONNX model, building, and testing on hardware will lose.

I constantly deal with these intersections: a model seems to run, then hits memory limits, latency, or incorrect preprocessing. If you're considering automation with AI on top of video, documents, or visual inspection, you can freely bring it to Nahornyi AI Lab, and Vadym Nahornyi and I will design an AI architecture for your real process, not just a pretty slide.

We previously examined the Code Map UX pattern, which speeds up code navigation by precisely injecting AI context. This approach resonates with the new OpenCV 5 capabilities, where LLMs and VLMs are embedded directly into the computer vision engine.

Share this article