All playbooks
Voice AIFebruary 10, 20265 min read

The AI voice agent build playbook: ElevenLabs + Twilio + Claude, step by step

Everything we've learned deploying AI voice agents for clinics, home services, and law firms — architecture, prompts, latency tricks, and the exact stack that makes it sound human.

GG
Gavish Goyal
Founder, NoFluff Pro
The AI voice agent build playbook: ElevenLabs + Twilio + Claude, step by step

The technology that makes a voice AI sound human did not exist 18 months ago. The gap between 'press 1 for sales' hell and natural voice conversation closed faster than anyone predicted. Here's the exact stack and build playbook we use.

When people think 'voice AI,' most still picture 2015-era interactive voice response (IVR) systems: menu trees, robotic voices, and customers mashing '0' to reach a human. That's not what modern voice AI is. Modern voice AI is a conversation that feels 90% human — because the components that made it robotic are finally fixed.

We've deployed voice agents for a clinic chain, a home services company, and a law firm in the last year. Here's what actually works, what doesn't, and how you'd build it yourself if you wanted to.

The stack

Layer
What we use
Why
TelephonyTwilio VoiceMature, reliable, global number coverage, generous programmatic API
Speech-to-textOpenAI Whisper / Deepgram Nova-2Sub-300ms latency, handles accents, streams partial results
LLM brainClaude Sonnet 4.5Best tool use, best multi-turn conversation coherence in 2026
Text-to-speechElevenLabs v3 TurboHuman-indistinguishable voices, 250ms TTFT, emotion control
OrchestrationCustom Python on FastAPI + RedisWe need sub-500ms total latency — off-the-shelf orchestrators are too slow
Tools (functions)Calendar API, CRM API, RAG store, SMS senderThe agent's ability to DO things, not just talk

The latency obsession

Here's the single most important thing about building voice agents that don't feel robotic: latency is everything. If there's more than 500ms between when the caller stops speaking and when the agent starts responding, it feels wrong. Under 500ms, it feels natural. Under 300ms, it feels eerily human.

This is why most off-the-shelf voice AI platforms feel bad. They chain naive sequential steps (STT → LLM → TTS) and the total latency balloons to 2-3 seconds. Every conversation has awkward pauses. Callers hate it.

< 500ms

the latency target that separates 'feels human' from 'feels robotic'

How we hit 500ms

  1. Streaming everywhere. STT streams partial transcripts to the LLM before the caller finishes speaking. LLM streams tokens to TTS before it finishes generating. TTS streams audio to Twilio as it's synthesized. All three layers are pipelined.
  2. Voice activity detection (VAD). Don't wait for silence — detect end-of-speech aggressively. Saves 200-400ms per turn.
  3. Optimistic tool calls. When the LLM decides it needs to check the calendar, start the API call immediately while the LLM is still generating the acknowledgment sentence.
  4. Pre-warmed TTS connection. ElevenLabs has a websocket mode — we keep it open so first-byte-to-speech is 250ms, not 800ms.
  5. Regional deployment. Host the orchestrator in the same AWS region as Twilio's SIP endpoints. Shaves 50-100ms of network round trips.

The prompt structure

Voice prompts are very different from chat prompts. You can't use markdown, bullet points, or long responses. Everything has to be conversational, short, and voice-rhythmic. Here's the structure we use:

Voice agent system prompt structuretext
You are [AGENT_NAME], the AI receptionist for [BUSINESS].

PERSONALITY:
- Warm, professional, conversational
- Use contractions ("I'll", "that's", "we're")
- Short sentences. Voice rhythm matters.
- Never read lists out loud. Pick 2-3 options max.

WHAT YOU CAN DO:
- Answer questions about hours, services, pricing
- Check calendar availability and book appointments
- Transfer to a human for anything else

WHAT YOU MUST NOT DO:
- Never say "as an AI"
- Never say "I don't have access to that information"
  (use the tools instead)
- Never read URLs or email addresses out loud
  (send them via SMS after the call)

CONVERSATION FLOW:
1. Greet caller warmly, ask how you can help
2. Listen for intent (booking / question / other)
3. Use tools to take action (check_calendar, book_appointment, etc)
4. Confirm action with caller
5. Offer anything else before ending

TOOLS AVAILABLE:
- check_calendar_availability(date, service)
- book_appointment(name, phone, date, time, service)
- send_sms(phone, message)
- transfer_to_human(reason)

CRITICAL RULES:
- If caller seems frustrated, transfer to human immediately
- If caller asks for pricing you don't know, use the FAQ tool
- If call has been >5 minutes without resolution, offer transfer
- Never make up information. Use tools or say you'll get back.

Three things to notice: explicit 'what you can't do' constraints (LLMs need this), tools defined as the agent's hands (this is the biggest unlock), and conversational rhythm guidance ('short sentences. voice rhythm matters.') which genuinely affects how the model responds.

The build phases

01

Week 1: Telephony + basic conversation

Twilio number provisioned, incoming calls route to your orchestrator, STT → LLM → TTS pipeline working end-to-end. At this point the agent can talk to callers but can't DO anything.

02

Week 2: Tool integration

Calendar API, CRM API, SMS sender, RAG FAQ store. The agent can now check availability, book appointments, look up information, send follow-up texts. This is when it starts being useful.

03

Week 3: Prompt tuning + edge cases

Test with 50+ real-world call scenarios. Handle accents, background noise, interruptions, confused callers, angry callers, callers asking about things outside scope. This week is where the quality jumps from 'demo' to 'production.'

04

Week 4: Human handoff + monitoring

Seamless transfer to a human when needed (with full context passed over). Dashboards showing call outcomes, average duration, satisfaction scores. Alerting when things break.

05

Week 5: Soft launch + iteration

Route 20% of calls to the agent. Record everything. Listen to 50+ conversations. Find the patterns the agent handles badly. Tune prompt. Expand to 100% once quality bar is met.

What tests we actually run before going live

  • 500 adversarial test calls with prompted edge cases (angry caller, non-English speaker, background noise, multiple questions at once)
  • Accent coverage test — at minimum American, British, Indian, and whatever regional accents your customer base has
  • Interruption handling — the caller talks over the agent mid-sentence, agent should gracefully stop and listen
  • Noise test — calls from highway, restaurant, airport. Agent should still transcribe correctly
  • Hand-off test — transfer to human 10 times across different scenarios, verify context is passed correctly
  • Calendar boundary test — try to book outside business hours, at the same time as another appointment, with invalid dates. Agent should handle gracefully.
Real NoFluff Case Study

Clinic chain: 0 missed calls, 31% more bookings in 60 days

Read the full breakdown
0
missed calls

FAQ

Way better than you think. With ElevenLabs v3 voices and proper prompt rhythm, we regularly run a test where we ask clients to guess whether a 60-second recording is AI or human. They get it right about 55% of the time — barely better than chance. The giveaways are usually tempo, not voice quality.

Build a voice agent that sounds human.

We build production voice agents for service businesses. Full stack, full ownership, sub-500ms latency. Typical build: 3-5 weeks. See a live demo on your free strategy call.

Hear a live demo