EngineeringAI2026-04-07 · 8 min read

How I Solved the Hardest Problem in Multi-Agent AI Voice Chat: Making Personas Actually Hear Each Other

The real challenge of building multi-persona AI voice group chat isn't getting one AI to talk — it's building the plumbing so they genuinely hear each other and respond naturally.

Most multi-agent AI demos I've seen pass text around. Agent A generates a response, you append it to a shared context buffer, Agent B reads it and responds. Clean, simple, works fine for async text chat.

Voice is a different beast entirely.

I spent the past several months building Personaplex — a group voice chat room where you can have a real-time spoken conversation with three distinct AI personas simultaneously. Not one AI pretending to be multiple characters. Three independent AI sessions, each with its own voice, personality, and speaking style, all talking to each other and to you in real time.

The architecture works. But getting there required solving a problem I hadn't seen documented anywhere: how do you give AI voice agents audio-level awareness of each other?

The Problem: Text Context Is Not Audio Context

Most cloud voice AI services treat each session as an isolated conversation. If I run three sessions for three personas, and Persona A says something, Personas B and C have no idea it happened. They're conversationally deaf to each other.

The naive fix is to transcribe everything and inject text into each session's context. I tried this. It works badly. The LLM responses sound like they're reading from a log rather than listening to a conversation, and there's no natural turn-taking — all three personas try to respond to everything simultaneously.

What you actually need is to give each AI the same sensory input a human participant would have: audio.

The Solution: Audio Cross-Injection

Here's the core insight: the realtime dialogue API isn't just for microphone input — it's a generic PCM audio input stream. If I feed Persona B's session the raw PCM output that Persona A just spoke, B's ASR will transcribe it exactly as if a human had said those words into the microphone. The LLM then responds as if it genuinely heard A speaking. Because from its perspective, it did.

Audio injection flow
Persona A finishes speaking (TTS_FINISHED event)
  → collect A's buffered TTS PCM chunks (24kHz, s16le)
  → downsample 24kHz → 16kHz (linear interpolation)
  → wrap each chunk in an audio frame
  → send to Persona B's open WebSocket session
  → send to Persona C's open WebSocket session
  → B and C's ASR hears A's voice, transcribes it
  → LLM responds to what A said
  → conversation feels alive

The downsampling step is critical. The TTS output is 24kHz. The ASR input expects 16kHz. The conversion is linear interpolation: for every output sample at position i, compute the fractional source index i * 1.5, interpolate between the two surrounding input samples.

Floor Control

With three AI personas, you immediately hit the next problem: they all try to respond at once. I solved this with a Valkey (Redis-compatible) atomic lock:

SET room:{roomId}:floor {personaId} NX PX 90000

Only the first persona to trigger a TTS_STARTED event gets the floor. The others see the lock exists and stay quiet. The lock releases automatically after 90 seconds (preventing deadlock) or immediately when TTS_FINISHED fires.

For safe release, a Lua script atomically checks the current floor holder and deletes the key only if it matches:

if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('del', KEYS[1])
else
  return 0
end

Interruption

Users need to be able to break in mid-sentence. The browser monitors microphone RMS level continuously. When it crosses a threshold while an AI is speaking, it sends a JSON message:

{ type: "interrupt" }

The server runs the same Lua floor-release script, then clears any buffered TTS audio to prevent injecting the interrupted persona's unfinished speech into the others. The conversation resets cleanly. You have the floor.

What Surprised Me

  • The audio injection approach produces qualitatively different conversations than text context sharing. The difference is not subtle.
  • Language learners are the most engaged user segment — the teacher AI corrects you while the native speaker responds to your corrected version, simultaneously.
  • Users find use cases I never anticipated: therapy practice, interview prep, TRPG sessions, decision-making with a devil's advocate.
  • The interrupt feature (you speak = AI stops) feels surprisingly natural. More polite than I expected.

Tech Stack

  • Custom Node.js HTTP+WebSocket server wrapping Next.js 15
  • Volcengine Doubao realtime dialogue API (ASR + LLM + TTS in one pipeline, ~400ms latency)
  • Valkey (Redis-compatible) for atomic floor locking
  • PostgreSQL 17 for transcripts and user data
  • React 19 + PCM queue player (24kHz, no buffering delay)

Try it yourself

The result is live at personaplex.aifly.club. Free tier: 30 minutes/day, no credit card required. The default room has three personas — a teacher, a comedian, and an advisor. Ask them something they'd disagree about.

Start Free →
How I Solved the Hardest Problem in Multi-Agent AI Voice Chat — Personaplex Blog | Personaplex