Skip to main content
OCRVLMAI automation

Why VLMs Fail on Number Plates and How to Fix It

In practice, VLMs struggle to read small numbers accurately, often confusing similar characters like M/N or 6/9 and failing to maintain a consistent output format. This isn't a dead end for AI automation. The solution is a multi-step pipeline: detect and crop the number plate area, identify its format, and then read character groups separately.

Technical Context

I prefer cases like this over flashy demos. The discussion highlighted something I regularly see in real-world AI implementation: a small VLM model like E4B seems "powerful" but consistently confuses similar characters on license plates—M and N, 6 and 9—plus it sometimes fails to maintain the response structure.

And this isn’t surprising. If the input image is scaled down significantly, the model physically loses small details. For a number plate, this is fatal: one stroke disappears, and the letter changes completely.

What I liked here wasn't the complaint but the engineering mindset from the thread. Instead of trying to force a perfect OCR result from the model with a single prompt, the idea was to build a pipeline: first, find the number plate area, then crop it, then determine the country and format, and only then read the characters in sections rather than all at once.

This is exactly how I would approach it. First, a bounding box or at least a rough localization. Then, a separate pass for a template like AA 1234 or AB 12 CD. Finally, sequential reading of groups, where the model doesn't diffuse its attention across the entire image.

Another key point: if a model struggles to follow the output format, don't argue with it in a single request. I usually break the task into steps and force each step to return a very specific, narrow JSON. This isn't magic; it's just proper AI integration instead of hoping "it will get it right this time."

Cheap fine-tuning also sounds logical here, especially if you have many similar plates, cameras, and countries. But I wouldn't start with it. Until a clear multi-step process is in place, fine-tuning often just masks an architectural problem.

Impact on Business and Automation

For production, the takeaway is simple: a single VLM call on an entire frame does not equal reliable OCR. If an error affects a barrier, a fine, parking, or logistics, you need a pipeline-first approach, not a belief that a "universal multimodal model will do it all."

The winning teams are those who can break down a task into stages and measure confidence at each step. The losing teams are those who build a critical process on a single, raw model response.

I see this as AI solutions architecture, not just picking the next trendy model. At Nahornyi AI Lab, this is precisely what we build for clients: determining where a crop is needed, where format validation is required, where a fallback to a second pass is necessary, and where it truly makes sense to build AI automation around a VLM so that it saves time instead of creating manual verification on top of manual verification.

If you have a similar story with documents, numbers, or small text in photos, we can quickly review your pipeline and find where the model is losing the signal. Usually, the problem isn't a "bad AI" but that it was given too large a chunk of a task. This is exactly the kind of situation where Nahornyi AI Lab can build a calm, working system instead of another beautiful but fragile demo.

A related discussion on visual AI models explored Seedance 2, a video model designed for AI video generation. Understanding the production realities and business value of such visual models is crucial when evaluating new Visual Language Model pipelines.

Share this article