What languages does it support?

We deploy primarily in English, Spanish, and Hindi. The underlying models (Claude, ElevenLabs, Whisper) support 30+ languages natively. Quality is best in English, strong in major European and Asian languages, weaker in low-resource languages.

What if the caller wants a human?

The agent transfers immediately — no friction. We build in a 'speak to a human' intent that's detected on any phrase even remotely suggesting it ('real person,' 'manager,' 'this isn't working'). The transfer carries full context: who they are, why they called, what was discussed.

How much does it cost to run?

Variable cost is $0.15-0.30 per conversation minute (Twilio + STT + LLM + TTS stacked). For a business receiving 500 calls/month at 3 min average, that's $225-$450/month in variable cost. Plus infrastructure ($50-$100/month). Plus the one-time build ($15K-$25K). Compared to a receptionist at $35K+/year, payback is typically 3-6 months.

Can the voice be customized?

Yes. ElevenLabs v3 lets you clone a specific voice (with consent) or pick from their library of 1,000+ pre-made voices. We typically recommend picking a library voice first — voice cloning adds complexity and quality is usually identical for receptionist use cases.

The AI voice agent build playbook: ElevenLabs + Twilio + Claude, step by step

Q: How human does it actually sound?

Way better than you think. With ElevenLabs v3 voices and proper prompt rhythm, we regularly run a test where we ask clients to guess whether a 60-second recording is AI or human. They get it right about 55% of the time — barely better than chance. The giveaways are usually tempo, not voice quality.

The technology that makes a voice AI sound human did not exist 18 months ago. The gap between 'press 1 for sales' hell and natural voice conversation closed faster than anyone predicted. Here's the exact stack and build playbook we use.

When people think 'voice AI,' most still picture 2015-era interactive voice response (IVR) systems: menu trees, robotic voices, and customers mashing '0' to reach a human. That's not what modern voice AI is. Modern voice AI is a conversation that feels 90% human — because the components that made it robotic are finally fixed.

We've deployed voice agents for a clinic chain, a home services company, and a law firm in the last year. Here's what actually works, what doesn't, and how you'd build it yourself if you wanted to.

The stack

Layer	What we use	Why
Telephony	Twilio Voice	Mature, reliable, global number coverage, generous programmatic API
Speech-to-text	OpenAI Whisper / Deepgram Nova-2	Sub-300ms latency, handles accents, streams partial results
LLM brain	Claude Sonnet 4.5	Best tool use, best multi-turn conversation coherence in 2026
Text-to-speech	ElevenLabs v3 Turbo	Human-indistinguishable voices, 250ms TTFT, emotion control
Orchestration	Custom Python on FastAPI + Redis	We need sub-500ms total latency — off-the-shelf orchestrators are too slow
Tools (functions)	Calendar API, CRM API, RAG store, SMS sender	The agent's ability to DO things, not just talk

The latency obsession

Here's the single most important thing about building voice agents that don't feel robotic: latency is everything. If there's more than 500ms between when the caller stops speaking and when the agent starts responding, it feels wrong. Under 500ms, it feels natural. Under 300ms, it feels eerily human.

This is why most off-the-shelf voice AI platforms feel bad. They chain naive sequential steps (STT → LLM → TTS) and the total latency balloons to 2-3 seconds. Every conversation has awkward pauses. Callers hate it.

< 500ms

the latency target that separates 'feels human' from 'feels robotic'

How we hit 500ms

Streaming everywhere. STT streams partial transcripts to the LLM before the caller finishes speaking. LLM streams tokens to TTS before it finishes generating. TTS streams audio to Twilio as it's synthesized. All three layers are pipelined.
Voice activity detection (VAD). Don't wait for silence — detect end-of-speech aggressively. Saves 200-400ms per turn.
Optimistic tool calls. When the LLM decides it needs to check the calendar, start the API call immediately while the LLM is still generating the acknowledgment sentence.
Pre-warmed TTS connection. ElevenLabs has a websocket mode — we keep it open so first-byte-to-speech is 250ms, not 800ms.
Regional deployment. Host the orchestrator in the same AWS region as Twilio's SIP endpoints. Shaves 50-100ms of network round trips.

The prompt structure

Voice prompts are very different from chat prompts. You can't use markdown, bullet points, or long responses. Everything has to be conversational, short, and voice-rhythmic. Here's the structure we use:

Voice agent system prompt structuretext

You are [AGENT_NAME], the AI receptionist for [BUSINESS].

PERSONALITY:
- Warm, professional, conversational
- Use contractions ("I'll", "that's", "we're")
- Short sentences. Voice rhythm matters.
- Never read lists out loud. Pick 2-3 options max.

WHAT YOU CAN DO:
- Answer questions about hours, services, pricing
- Check calendar availability and book appointments
- Transfer to a human for anything else

WHAT YOU MUST NOT DO:
- Never say "as an AI"
- Never say "I don't have access to that information"
  (use the tools instead)
- Never read URLs or email addresses out loud
  (send them via SMS after the call)

CONVERSATION FLOW:
1. Greet caller warmly, ask how you can help
2. Listen for intent (booking / question / other)
3. Use tools to take action (check_calendar, book_appointment, etc)
4. Confirm action with caller
5. Offer anything else before ending

TOOLS AVAILABLE:
- check_calendar_availability(date, service)
- book_appointment(name, phone, date, time, service)
- send_sms(phone, message)
- transfer_to_human(reason)

CRITICAL RULES:
- If caller seems frustrated, transfer to human immediately
- If caller asks for pricing you don't know, use the FAQ tool
- If call has been >5 minutes without resolution, offer transfer
- Never make up information. Use tools or say you'll get back.

Three things to notice: explicit 'what you can't do' constraints (LLMs need this), tools defined as the agent's hands (this is the biggest unlock), and conversational rhythm guidance ('short sentences. voice rhythm matters.') which genuinely affects how the model responds.

The build phases

Week 1: Telephony + basic conversation

Twilio number provisioned, incoming calls route to your orchestrator, STT → LLM → TTS pipeline working end-to-end. At this point the agent can talk to callers but can't DO anything.

Week 2: Tool integration

Calendar API, CRM API, SMS sender, RAG FAQ store. The agent can now check availability, book appointments, look up information, send follow-up texts. This is when it starts being useful.

Week 3: Prompt tuning + edge cases

Test with 50+ real-world call scenarios. Handle accents, background noise, interruptions, confused callers, angry callers, callers asking about things outside scope. This week is where the quality jumps from 'demo' to 'production.'

Week 4: Human handoff + monitoring

Seamless transfer to a human when needed (with full context passed over). Dashboards showing call outcomes, average duration, satisfaction scores. Alerting when things break.

Week 5: Soft launch + iteration

Route 20% of calls to the agent. Record everything. Listen to 50+ conversations. Find the patterns the agent handles badly. Tune prompt. Expand to 100% once quality bar is met.

What tests we actually run before going live

500 adversarial test calls with prompted edge cases (angry caller, non-English speaker, background noise, multiple questions at once)
Accent coverage test — at minimum American, British, Indian, and whatever regional accents your customer base has
Interruption handling — the caller talks over the agent mid-sentence, agent should gracefully stop and listen
Noise test — calls from highway, restaurant, airport. Agent should still transcribe correctly
Hand-off test — transfer to human 10 times across different scenarios, verify context is passed correctly
Calendar boundary test — try to book outside business hours, at the same time as another appointment, with invalid dates. Agent should handle gracefully.

Real NoFluff Case Study

Clinic chain: 0 missed calls, 31% more bookings in 60 days

Read the full breakdown

missed calls

FAQ