Technical Context
I dove into the gpt-realtime-2 specifications with a practical question: can this finally be used to build proper AI automation for calls, support, and voice assistants, not just another pretty prototype? The short answer is yes, it can. This is the moment OpenAI has squeezed latency down to a level where dialogue no longer falls apart.
The model accepts text, audio, and images, and outputs text and voice. It connects via WebRTC, WebSocket, or SIP, meaning browser, server, and telephony are covered without any acrobatic workarounds. The context is 32k, with a maximum response of 4096 tokens, and its knowledge cutoff is October 2023.
What really impressed me is that this isn't just STT plus LLM plus TTS glued together from three separate services. Here, the speech-to-speech stream runs in a single real-time loop with proper interruption handling. This is critical for a live conversation: when a person interjects, the model doesn't freeze and wait for the end of the phrase like an answering machine from 2014.
In terms of numbers, OpenAI claims a 48% improvement in instruction following and a 34% improvement in tool calling compared to the preview. For production, they explicitly recommend `reasoning.effort: low`, which makes sense: in voice, a few hundred extra milliseconds hurt more than slightly less deep reasoning.
Among the useful features for system building, I noted MCP tools, image input, separate real-time scenarios for translation and streaming transcription, plus `session.update` for automatic tool connection. The pricing has also become more reasonable: $4 per million input tokens and $16 per million output, about 20% cheaper than the preview.
But let's not view this through rose-tinted glasses. The voices are still limited, and there are no custom voice profiles or SSML. So, for specific brands, accents, or localized delivery, I'd still consider an external TTS chain.
What This Changes for Business and Automation
The first clear winner here is voice support. While previous artificial intelligence implementation in telephony often failed due to latency and poor interruption handling, you can now build an agent that, while not perfectly human, no longer infuriates users after the second sentence.
The second use case is real-time interfaces in applications: scheduling appointments, dispatching, and internal voice assistants for teams. The architecture is simplified because there are fewer separate nodes, less synchronization between STT, LLM, and TTS, and fewer points of failure that can crash overnight.
The losers in this story are those who built their product around the old cascaded architecture and saw it as the only option. It won't disappear, but now it will have to be justified by customization, not just by its mere existence.
Still, I wouldn't push this into production without proper testing for noise, interruptions, per-minute costs, and real-world telephony. At Nahornyi AI Lab, this is precisely what we build for clients: we don't just bolt on an API, but refine the AI integration until the system saves time instead of creating a new layer of chaos. If your voice processes are already slowing down your team, let's see how we can build a working AI solution development here without any unnecessary magic.