Technical Context
I appreciate such studies for their groundedness: not an abstract benchmark, but 90 independent runs of the same task. Agents built a real-time retrospective board from a single specification, and the result was evaluated against 14 criteria with a 42-point ceiling plus visual review.
For me, the key takeaway isn't the UI but the implication for AI implementation. The authors tested what really improves first-attempt reliability: model tier, reasoning effort, access to testing tools, and design-oriented prompting.
The strongest signal: reasoning effort won decisively. Moving from High to xHigh reasoning effort increased the share of perfect first-run deliveries from 28% to 89%, while corrective prompts dropped about fivefold. That's not a cosmetic tweak; it's a regime change.
Now, here's the point where I'd pause if I were on many teams. Testing tools did not improve functional reliability, even where they seemingly should have caught issues—yet they raised costs by 42–68%.
Model tier also proved to be a dominant factor. Frontier models operated near the ceiling, while a weaker local model significantly underperformed. Design-oriented prompting improved the visual aspect but not functionality, which mirrors real life: prettier doesn't mean more reliable.
What This Changes for Business and Automation
When I design an AI architecture for a coding agent, I'm now even more cautious about the idea of "let's throw in more tools and it will become more reliable." No—you first need to buy the model's thinking capability, then wrap it with tools.
The second practical insight: a cheap agent with a plethora of checks may end up costlier and weaker than a stronger model with a high reasoning budget. For AI automation, this is uncomfortable but useful math.
The teams that win are those that count the cost of a successful first run, not just the token price. The ones that lose confuse orchestration complexity with output quality.
At Nahornyi AI Lab, we solve exactly these puzzles in practice: where you need strong reasoning, where a simple pipeline is sufficient, and where tools only inflate the bill. If your AI integration in development is already consuming budget but delivering unpredictable results, let's discuss your scenario and I'll propose AI solution development without unnecessary agentic magic.