Menu

Connect

📝 Post

Building Sub-200ms Voice AI with OpenAI Realtime API

aivoiceopenaitwiliorealtime
By Ryan Cwynar3 min read

Voice AI has always had a latency problem. Traditional pipelines—speech-to-text, LLM processing, text-to-speech—stack delays that make conversations feel robotic. Users wait 2-3 seconds for responses. It kills the magic.

OpenAI's Realtime API changes everything. We're talking sub-200ms response times. Bidirectional audio streaming. Real conversations with AI.

Here's how I built it.

The Architecture

Twilio (Phone) <-> WebSocket Server <-> OpenAI Realtime API
     ↓                    ↓                    ↓
  PSTN Audio      Media Stream Bridge      GPT-4o Realtime
   (μ-law)           (Base64)              (PCM 24kHz)

The key insight: no transcription step. Audio goes directly to the model, and audio comes directly back. The model "hears" and "speaks" natively.

Twilio Media Streams

When someone calls your Twilio number, you respond with TwiML that opens a WebSocket:

<Response>
  <Connect>
    <Stream url="wss://your-server.com/media-stream">
      <Parameter name="callerNumber" value="{From}"/>
    </Stream>
  </Connect>
</Response>

Twilio sends audio chunks as base64-encoded μ-law (8kHz). You'll need to transcode to PCM 24kHz for OpenAI.

The WebSocket Bridge

Your server maintains two WebSocket connections:

  1. Twilio → Your Server: Receives caller audio
  2. Your Server → OpenAI: Sends/receives audio from the model
// Simplified flow
twilioWs.on('message', (data) => {
  const { event, media } = JSON.parse(data);
  if (event === 'media') {
    const pcmAudio = transcode(media.payload); // μ-law → PCM
    openaiWs.send(JSON.stringify({
      type: 'input_audio_buffer.append',
      audio: pcmAudio
    }));
  }
});

openaiWs.on('message', (data) => {
  const event = JSON.parse(data);
  if (event.type === 'response.audio.delta') {
    const mulawAudio = transcode(event.delta); // PCM → μ-law
    twilioWs.send(JSON.stringify({
      event: 'media',
      streamSid: streamSid,
      media: { payload: mulawAudio }
    }));
  }
});

OpenAI Realtime Session Setup

const openaiWs = new WebSocket(
  'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17',
  {
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`,
      'OpenAI-Beta': 'realtime=v1'
    }
  }
);

// Configure the session
openaiWs.send(JSON.stringify({
  type: 'session.update',
  session: {
    turn_detection: { type: 'server_vad' },
    input_audio_format: 'pcm16',
    output_audio_format: 'pcm16',
    voice: 'ash',
    instructions: 'You are a helpful assistant...',
    input_audio_transcription: { model: 'whisper-1' }
  }
}));

Server VAD: The Secret Sauce

turn_detection: { type: 'server_vad' } enables Voice Activity Detection on OpenAI's side. The model automatically detects when the user stops speaking and begins responding. No manual endpointing needed.

This is crucial for natural conversation flow.

Transcription Bonus

Even though audio goes directly to the model, you can still get transcriptions:

// OpenAI sends these events
{ type: 'conversation.item.input_audio_transcription.completed', transcript: '...'}  
{ type: 'response.audio_transcript.delta', delta: '...'}  

I save both sides to /transcripts/ for logging and analysis.

Latency Breakdown

| Component | Time | |-----------|------| | Twilio → Server | ~50ms | | Server → OpenAI | ~30ms | | Model Processing | ~80ms | | OpenAI → Server | ~30ms | | Server → Twilio | ~50ms | | Total | ~200ms |

Compare this to traditional pipelines (2-3 seconds) and it's night and day.

Production Considerations

  1. Audio Format Hell: Twilio uses μ-law 8kHz. OpenAI wants PCM 24kHz. Budget time for transcoding bugs.

  2. WebSocket Lifecycle: Handle disconnections gracefully. Twilio will retry, but OpenAI sessions need manual reconnection.

  3. Costs: Realtime API pricing is per-minute of audio. Monitor usage closely.

  4. Interruptions: The model can be interrupted mid-response. Handle response.cancelled events.

Results

The voice assistant feels genuinely conversational. Users can interrupt, ask follow-ups, and get responses fast enough that it feels like talking to a person.

This is the future of voice interfaces. The latency barrier is broken.


Building voice AI? I'd love to hear about your approach. Find me on LinkedIn.