Technical Context
I appreciate cases like this not for the wow factor, but for the down-to-earth engineering. Someone has actually built a multi-agent system in Claude Code: there's a main orchestrator, project agents, direct communication with each, and even the ability to dynamically create or delete new ones.
For AI automation, this is no longer a toy. It's almost a living operating system for tasks like research, correspondence, accounting, drafting emails, and project management, all within a unified memory and logging loop.
What caught my attention is that this was done mainly through a Claude Code subscription, limited to about 10 iterations, not via an API. Validation is sometimes handled by Codex, with Claude Code and Gemini also acting as sub-agents. The setup works, but it immediately suggests hitting a ceiling with limits. If someone tries to extract a pseudo-API from the subscription, they're not far from violating the TOS.
But something else is more important. A very relevant question about quality metrics came up in the discussion, and this is where the magic quickly fades. Having 20, 30, or 50 agents talking to each other doesn't automatically mean you have a good artificial intelligence implementation. Without stopping criteria, token budgets, and clear quality gates, you're just burning context beautifully and enthusiastically.
A telling snippet from the benchmarks: the decision-making skill burned one and a half times more tokens and produced a worse result than an agent without skills. Meanwhile, the architecture-review skill delivered roughly double the quality for the same one and a half times the tokens. I'd translate this as: not every skill improves the system; some just add ceremony and noise.
And yes, this aligns perfectly with what I see in practice. If an agent is poor at breaking down tasks, setting priorities, and knowing when to stop, multi-agency starts eating the budget linearly. However, a review layer before execution often pays for itself very quickly because it's cheaper to catch a bad plan than to clean up bad code or faulty automation later.
Impact on Business and Automation
Teams with many parallel cognitive tasks win: research, reviews, communications, and project support. In these cases, AI integration with an orchestrator truly saves hours and reduces manual context switching.
Those who think more agents automatically mean better results lose out. For routine tasks, a single, well-tuned agent is almost always cheaper and more stable than a 'village' of bots.
I would establish three rules: an iteration limit, kill criteria for stuck agents, and a separate architecture-review before execution. These are precisely the kinds of bottlenecks we address for clients at Nahornyi AI Lab when building AI solutions for business without the fireworks of wasted tokens.
If you're already facing a chaos of chats, tasks, and manual checks, we can build proper AI automation for your process without this menagerie. Get in touch, and Vadym Nahornyi and I at Nahornyi AI Lab will assess where you need one strong agent versus where it actually makes sense to build an orchestrator.