Technical Context
I love these kinds of problems, where standard diarization breaks down within the first minute. When speakers don't take turns and a bar-like cacophony begins, it's no longer simple speaker diarization but the classic cocktail party problem. For a proper AI integration into a product, just plugging in Whisper and hoping for a miracle isn't enough.
I'd immediately divide the tools into two classes. The first class genuinely tries to understand who is speaking and when, even if voices overlap. The second class first separates the audio by source, and only then do you feed the result into an ASR or your AI automation chain.
From the first class, I'd look at EEND, which stands for End-to-End Neural Diarization. This isn't the old pipeline of VAD, embeddings, clustering, and prayer. The model is trained directly to handle unknown speakers, overlaps, and online processing, and ESPnet already has working recipes and streaming scenarios for this.
This is where I'd really stop and not waste a week on exotic alternatives. If you need realtime and don't have pre-registered speaker embeddings, EEND and ESPnet currently look like the most sensible direction.
Meta's SAM Audio is interesting. I've dug into its logic, and it's excellent specifically as a source separation layer. It can extract sounds from a messy mix using prompts, but it's not native diarization or a system that will neatly return timestamps for unknown people in a live conversation.
SpeechBrain's sepformer-wham is also useful, but honestly, it's more about separation than a complete solution. I would use it as a pre-processing step before ASR or diarization if the voice overlap is particularly severe.
The idea of using an LLM to label a finished transcript by meaning sounds tempting, and I've tested such setups myself. However, this is post-processing, not realtime, and with noisy overlaps, it's more likely to correct the dialogue structure than to save a broken audio stream.
What This Means for Business and Automation
In practice, the winners will be businesses dealing with calls, meetings, dispatch lines, interviews, and support chats with multiple simultaneous speakers. There, accuracy isn't just a nice metric—it determines whether your conversation analytics, CRM logic, and subsequent automation with AI will break.
The losers will be teams building a product solely on ASR without separation or overlap-aware diarization. An error in who said what later damages summaries, call search functionality, and any AI agent that needs to act on context.
I would build the stack like this: overlap-aware diarization via EEND or ESPnet, separation via SAM Audio or SepFormer if needed, and only then ASR plus an LLM layer to fix the structure. At Nahornyi AI Lab, we specialize in manually dissecting these bottlenecks: if your audio pipeline loses meaning at intersections, we can build an AI solution development plan for your specific workflow, not for a generic demo scenario.