Core Concepts

Voice AIFundamentals

Master the essential concepts, terminology, and technical foundations that power every voice AI agent.

The Three Pillars of Voice AI

Every AI voice agent is a conversation loop of three components. Each pillar affects how natural and effective the conversation feels. If one lags — the entire user experience breaks.

1. Listening (STT)

Speech-to-Text converts the caller's speech into text.

What It Does: Captures audio and transcribes it in real-time

Why It Matters: Determines how fast and accurately the AI "understands." A delay here makes the agent feel slow or "deaf."

BlueMachines Uses: Deepgram nova-3, Sarvam, Cartesia

2. Thinking (LLM)

Language Model interprets what was said and decides what to say next.

What It Does: Processes text, reasons about context, generates responses

Why It Matters: This is the "brain." It controls tone, empathy, reasoning, and intent recognition.

BlueMachines Uses: GPT-4.1, GPT-5, Claude 4.6, Gemini, Grok

3. Speaking (TTS)

Text-to-Speech turns the response into human-like voice.

What It Does: Converts text to natural-sounding speech with emotion

Why It Matters: Impacts warmth, trust, and how "alive" the agent feels. Poorly tuned TTS = robotic experience.

BlueMachines Uses: ElevenLabs Flash v2.5, Cartesia Sonic

How the Loop Works in Real Time

Understanding the millisecond-by-millisecond flow

User speaks

The STT engine (e.g., Deepgram nova-3) converts speech → text. Speed and clarity here directly affect latency — the faster the STT, the sooner the LLM can start thinking.

LLM receives live text chunks

It processes partial transcripts and starts reasoning before the user even finishes. This is where preemptive generation comes in — the LLM begins crafting replies mid-sentence for near-zero lag.

LLM produces streaming output

The model sends text segments as it generates them (not after finishing the full thought).

TTS converts text stream into speech

Sentence by sentence, often word by word. Natural pauses, tone, and emotion here create realism — what we call acoustic believability.

The conversation flows

Voice Activity Detection (VAD) monitors when the user starts speaking again, and Turn Detection ensures smooth handoffs between human and AI.

Essential Technical Concepts

Terms you'll use every day in voice AI

Latency

The time delay between user speech and AI response. Lower latency = more natural conversation.

• < 500ms: Natural, responsive (BlueMachines standard)

• > 1s: Breaks conversational illusion

Voice Activity Detection (VAD)

Detects when a user starts or stops speaking. Critical for natural turn-taking.

• Min Silence Duration: Time before silence = "done speaking"

• Activation Threshold: Sensitivity to voice input

• BlueMachines uses Silero VAD

Turn Detection

Decides when AI should take the floor. Manages interruptions and smooth handoffs.

• Interrupt Duration: Detects when user cuts in

• Backoff Delay: Wait before resuming after user finishes

• Endpointing: Confirms user has truly finished

End of Utterance

Wait time added for the agent to infer if the user has completed their turn.

• Prevents premature interruption of user speech

• Balances responsiveness with natural conversation flow

• Typical range: 100-200ms depending on use case

Preemptive Generation

LLM starts forming a response while the user is still speaking. Minimizes latency.

• Enabled in realtime models (GPT-4o-realtime, Gemini Live)

• Crucial for "instant" responses

• Creates human-like conversational rhythm

Stability (TTS)

Controls consistency of tone in voice output. Balance between human-like and robotic.

• Low (0.2-0.4): Lively but inconsistent

• Balanced (0.6-0.8): Natural and reliable ✓

• High (>0.9): Robotic, flat delivery

SSML (Speech Synthesis Markup)

Adds emotion, pauses, and emphasis to spoken output. Makes agents sound alive.

• Control pauses: <break time="300ms"/>

• Add emphasis: <emphasis>

• Enables expressive "speech syntax"

Telephony & Transport Layer

How voice data travels between users and AI agents

SIP (Session Initiation Protocol)

The standard protocol for establishing, managing, and terminating phone calls over IP networks.

• SIP Trunking: Connects voice AI to PSTN (traditional phone network) via providers like Twilio, Plivo, Exotel

• Call Flow: INVITE → 100 Trying → 180 Ringing → 200 OK → ACK → Media (RTP) → BYE

• Use Case: Inbound/outbound phone calls to real phone numbers

WebRTC

Browser-based real-time communication for audio, video, and data — no plugins needed.

• Peer-to-Peer: Direct browser-to-server audio streaming with low latency

• STUN/TURN: NAT traversal servers that help establish connections through firewalls

• Use Case: Web-based voice agents, in-app calling, browser demos

WebSocket

Bidirectional, persistent connection for real-time streaming of audio and text data.

• Full-Duplex: Send and receive audio/text simultaneously over a single connection

• Server Pipeline: Streams audio between telephony gateway ↔ STT ↔ LLM ↔ TTS

• Use Case: Server-side audio transport, real-time transcription streaming

How They Connect

Different transport layers serve different entry points into the voice AI system.

• Phone Calls: SIP/PSTN → Telephony Gateway → WebSocket → AI Pipeline

• Browser/App: WebRTC → Media Server → WebSocket → AI Pipeline

• API Integrations: WebSocket directly to AI Pipeline for server-to-server

Audio Quality & Preprocessing

Clean audio in = accurate transcription out

Noise Cancellation

Suppresses background noise so STT can focus on the speaker's voice.

• Background Noise Suppression: Filters out traffic, office chatter, wind, machinery

• Echo Cancellation (AEC): Removes audio feedback when speaker hears their own voice

• Automatic Gain Control: Normalizes volume levels across quiet and loud speakers

Why It Matters

Real-world calls happen in noisy environments — offices, streets, cars, crowded spaces.

• Without preprocessing: STT accuracy drops 20-40% in noisy environments

• With preprocessing: Consistent transcription quality regardless of environment

• Provider support: Deepgram and Sarvam offer built-in noise suppression; Krisp can be added as a preprocessing layer

Voice AI Architecture

End-to-end flow from caller to AI and back

Phone Call Path (SIP/PSTN)

Phone
SIP/PSTN

→

Telephony
Gateway
Twilio/Plivo

→

WebSocket
Audio Stream

→

STT
Deepgram

→

LLM
GPT-4.1

→

TTS
Cartesia

→

WebSocket
Audio Stream

→

Telephony
Gateway

→

Phone
Caller

Browser Path (WebRTC)

Browser
WebRTC

→

Media
Server

→

STT → LLM → TTS

→

Media
Server

→

Browser
Speaker

Key Insight: Regardless of how the call arrives (SIP phone or WebRTC browser), it passes through the same STT → LLM → TTS pipeline. The transport layer handles getting audio in and out; the AI pipeline handles understanding and responding.

LLM Configuration Parameters

How to tune the "brain" of your voice agent

Parameter	What It Does	Why It Matters
Temperature	Controls creativity. Low = predictable, high = creative.	For voice agents, keep it low (~0.1) for consistent and confident speech. High temperature can make the AI sound indecisive.
Max Tokens	Limits how long responses can be.	Prevents overly verbose or delayed replies — ideal for call-center style agents that must sound concise.
Preemptive Generation	LLM starts forming response while user is still speaking.	Minimizes latency and overlap gaps — crucial for "instant" responses.
Realtime Mode	Chooses between text or speech input/output.	Speech mode synchronizes with the audio pipeline for smoother live streaming.

💡 Key Insight: LLM tuning is about flow control, not just intelligence. The right configuration ensures your AI doesn't talk over people, pause awkwardly, or "think aloud."

Why "Realtime" Models Matter

Traditional LLMs (like GPT-4, Claude Sonnet, Gemini Pro) wait until you stop speaking before replying. That's fine for typing — but it kills the illusion of a real voice.

Realtime models can:

•Listen while generating responses
•Stream partial outputs instantly
•Support dynamic interruption and mid-speech reasoning

Impact: Lower latency → more natural overlaps → human-like conversational rhythm.

People subconsciously associate response delay with emotional intelligence — real-time speed = perceived empathy.

The Performance Equation

Understanding total response time

Latency Calculation ≈ STT + Turn Detection + LLM + TTS + End of Utterance

Target: < 700 ms for natural feel

Input Latency

• STT: ~200 ms (streaming transcription)

• Turn Detection: ~150 ms (silence confirmation)

• End of Utterance: ~100-200 ms (wait time to infer user turn completion)

= ~450-550 ms input latency

Output Latency

• LLM Generation: ~200 ms (first token)

• TTS Synthesis: varies by provider (streaming audio)

= ~500 ms output latency

Human Perception Thresholds

< 200 msFeels simultaneous (human overlap)

200-500 msFeels responsive ✓ BlueMachines standard

500-1000 msFeels "robotic thinking"

> 1sBreaks conversational illusion

🧩 Key Insight: This is why every millisecond counts — the user's brain notices lag before their ears do.

The Art and Science of Voice AI

Building a great voice agent is not about a single model — it's about timing, tone, and trust.

Timing

Comes from tuned STT/VAD latency

Tone

Comes from balanced TTS parameters and low-temperature LLMs

Trust

Comes from responsive, emotionally consistent dialogue

When done right, the agent feels instantly responsive, emotionally fluent, and capable.

← Back to Why Voice AI Next: About BlueMachines