Technical Context
I love things like this not for the hype, but for how grounded they are: not a demo with perfect audio, but a real call to a Spanish restaurant. And this is where AI automation starts to look less like a toy and more like a proper foundation for business telephony.
In the published PoC, the agent called, spoke in Spanish, and booked a table for three at 8:00 PM. The best part isn't that it "said something," but that it made it all the way to a booking confirmation, even though the STT stumbled and interpreted some noise as a bizarre phrase like "¿Qué tipo de escándalos hay acá?" (What kind of scandals are there here?).
The tech stack is also telling. They used 11labs for TTS/STT, Zadarma for telephony, and Gemini 1.5 Flash as the brain. The call cost about 15-20 cents for a minute and a half, which is the level where I start seeing it not as an experiment, but as a candidate for artificial intelligence integration into operational processes.
I especially appreciated a small detail that usually wastes half a day of debugging: the API field is called message, not text. Anyone who has built voice pipelines by hand knows how much time this kind of thing can kill, especially when everything else looks "almost right."
Later, the author showed a second call, this time to the Pozalagua cave. The observation there was even more interesting: a short Hola! at the start works better than a longer introduction. This is very true to life. In voice agents, the first 2-3 seconds often decide everything: whether the person understands what's happening or just hangs up.
The author's next step, according to his notes, is a fully local ASR/TTS. And I get it. As soon as you step out of the sandbox, you immediately face issues with latency, privacy, cost at scale, and control over quality for a specific language and accent.
Impact on Business and Automation
Looking at this not as an enthusiast but as a business owner, the signal is simple: phone-based scenarios are starting to be truly automated. Booking, appointment confirmations, rescheduling, answering typical questions, collecting basic client data—all of this can now be assembled into a working AI solution development, not just a presentation for investors.
But I wouldn't jump to the false conclusion that the main problem is now solved. In my opinion, the biggest pain point here isn't TTS or even the LLM. The main landmine, as noted in the discussion, is turn detection: when to speak, when to stay silent, when not to interrupt, and when a pause means it's the agent's turn to speak again.
It's turn detection that makes the difference between "wow, it called by itself" and "oh god, please turn it off." A dialogue might be smart on paper, but if the agent talks over the other person or freezes after a clear answer, the user experience falls apart in seconds.
Who are the first to benefit from such systems? Restaurants, clinics, salons, local services, tourism—anyone whose inbound flow still lives on the phone. Who loses? Those who think they can just hook up a model to a SIP and get a ready-made employee without configuring scenarios, timings, fallback logic, and monitoring.
In cases like these, I always look at the entire architecture: telephony, recognition, dialogue flow management, session memory, human escalation rules, error logging, and cost per minute. At Nahornyi AI Lab, we solve these kinds of AI implementation challenges for clients at the intersection of business and engineering, where what matters isn't just "we have an agent," but that it actually reduces the team's workload.
My conclusion is simple: voice agents have moved beyond the circus trick stage. But the winners won't be the systems with the "smartest" voices, but those with a carefully assembled AI architecture that accounts for STT errors and has a polished conversation rhythm. If your business is losing leads on calls or your team is spending hours on repetitive conversations, let's analyze your flow calmly and professionally. At Nahornyi AI Lab, I can help you build AI automation so that the agent doesn't annoy customers but actually handles routine tasks, freeing up your people for real work.