Voice AI Agents: Architecture, Deployment & Evaluation Best Practices

Henry Finkelstein, Founding Growth Engineer

April 27, 2026 · 14 min read

A voice AI agent is software that answers a phone call, holds a goal-directed conversation with the caller, and updates downstream systems with whatever happened. In 2026 it is production infrastructure: teams in healthcare, financial services, restaurants, government, and dozens of other verticals run voice agents that handle thousands of calls per day, take real money, dispatch real services, and produce real liability when they get it wrong. The teams shipping them well have stopped treating “build a voice agent” as the goal and started treating “build the voice AI evaluation infrastructure that makes a voice agent shippable” as the actual unit of work. That worldview is what Coval was founded on.

This guide is for technical buyers and engineering leaders making decisions about voice AI agents in production. It covers the architecture choices that matter, the deployment patterns that work, and the evaluation requirements that separate shippable agents from expensive prototypes.

Key takeaways

Five-layer stack: Telephony, real-time transport, model stack, orchestration, and evaluation/observability. Weak links at the seams cause most production failures.
Architecture decision: Cascaded pipelines still dominate enterprise voice AI deployments, but speech-to-speech is catching up. Cascaded wins on observability because each stage produces traces that pinpoint where a conversation went wrong. Speech-to-speech wins on latency and prosody and (so far) leaves the model’s reasoning opaque, though the most recent S2S releases are starting to expose deeper reasoning observability.
Platform choice: Use Vapi, Retell, LiveKit, or Pipecat unless volume, customization, or compliance forces a custom build.
Coverage gap: Agents that pass 95% of test calls in dev typically convert to 60-70% of production calls. The fix is broader evaluation coverage rather than a better model.
Evaluation infrastructure is the moat: Pre-production simulation, production observability, and a feedback loop between them turn the agent into a system that improves over time.

What a voice AI agent does in 2026

A voice AI agent receives a phone call, holds a goal-directed conversation with the caller, completes one or more tasks during the call (lookups, transactions, scheduling, escalation), and updates downstream systems with whatever happened. The deployment surfaces vary: inbound customer service, outbound reminders and sales, drive-through ordering, internal IT helpdesks, healthcare scheduling, insurance intake. The underlying job is the same in each case: replace or augment a human agent on a phone call, at scale, reliably enough that the business is willing to bet on it.

The category has matured fast. Two years ago, most voice agent deployments were demos and pilots. In 2026, the leading platforms handle production traffic for major brands, run inside regulated industries with audit requirements, and integrate deeply with internal systems of record. The architecture and deployment patterns are converging.

The reference architecture

A modern voice AI agent stack has five layers. Teams that ship well have intentional decisions at each layer; teams that struggle usually have a weak link somewhere.

1. Telephony

The layer that takes a phone call off the public network and gets the audio into your stack. Twilio, Telnyx, Plivo, and managed offerings from voice agent platforms are the common choices. The decision is mostly about commercial terms, latency, and which providers integrate with your existing carrier relationships.

2. Real-time transport

The audio plumbing inside your stack. LiveKit has become the dominant choice for teams building on speech-to-speech architectures. Daily, Agora, and platform-managed transports are common for cascaded pipelines. Latency budgets here matter: every 50 milliseconds at the transport layer is 50 milliseconds the caller waits.

3. The model stack

In a cascaded pipeline: a speech-to-text model, a language model, and a text-to-speech model. In a speech-to-speech architecture: a single model handling audio in and audio out. We covered the trade-offs in speech-to-speech vs. cascaded voice AI. For most enterprise deployments, cascaded is still the default in 2026 because of the observability advantage.

4. Orchestration

The layer that decides what the agent does: which tools to call, when to escalate, when to ask clarifying questions, how to handle errors. Vapi, Retell, Pipecat, and custom orchestrators on LiveKit are the common patterns. The orchestration layer is often where most of the business logic lives.

5. Evaluation and observability

The layer that tells you whether the agent is working. Pre-production simulation that exercises the agent against test scenarios, production observability that grades every live conversation, and the feedback loop between them. This layer is where most teams underinvest, and it’s the layer that determines whether the agent survives contact with real callers.

The five layers form a stack, but the interesting failure modes happen at the seams between them. Tool calls go to the orchestration layer; if the language model picks the wrong tool or passes the wrong parameter, the call fails downstream. Audio events go up the stack; if the transport drops a packet at the wrong moment, the STT misses a critical word. Each seam is a category of bug, and each category needs its own evaluation strategy.

Cascaded vs. speech-to-speech: the architectural decision

The single most consequential architecture decision for a new voice AI agent is whether to use a cascaded pipeline or a speech-to-speech model. The choice cascades into every subsequent decision.

Cascaded pipelines give you three separate models you can swap, evaluate, and observe independently. STT word error rate, LLM intent accuracy, TTS naturalness. Each has its own metric, its own evaluation, and its own optimization path. Most regulated deployments use cascaded because the audit trail is cleaner: transcripts you can review, reasoning traces you can inspect, tool calls you can replay.

Speech-to-speech collapses the three models into one. Latency drops, sometimes dramatically. Prosody improves because the model is generating audio natively. The model can hear and respond to paralinguistic signals (tone, emotion) that text-based pipelines lose. The cost is observability: when the whole conversation happens inside one model, you can’t grade transcription quality independently because there’s no transcript.

The pattern across enterprise deployments in 2026: cascaded for high-stakes, regulated, or compliance-sensitive use cases (healthcare scheduling, insurance intake, financial services, government). Speech-to-speech for latency-critical or experience-critical use cases (drive-through ordering, premium customer service tiers) where the experience improvement justifies the observability cost.

Some teams run hybrid stacks (cascaded for the bulk of the agent, speech-to-speech for specific high-touch interactions). The complexity is real, but for some use cases the trade-off is worth it.

Choosing models and components

The architecture decision sets the shape of the stack; the model choices inside each layer are the next decision and feed into platform selection.

LLMs. GPT-4o, Claude Sonnet 4.6, and Gemini Flash are the most common cascaded-pipeline choices in 2026. For speech-to-speech, OpenAI’s GPT-Realtime-2 and Google’s Gemini 3.1 Flash Live are the production options. Claude has the edge on multi-turn instruction following. GPT-4o has the strongest tool-calling ecosystem. Gemini Flash wins on cost. Empirical comparison on your own data is the right way to choose; we covered the trade-offs in voice AI models in 2026.
STT. Deepgram Nova-3 is the most common commercial choice (~150ms streaming latency, real-time diarization). Whisper for non-English or harder audio. See best speech-to-text providers in 2026.
TTS. ElevenLabs Eleven v3 and Cartesia Sonic 3 are the premium options. The TTS choice often has more impact on caller perception than the LLM choice. Detailed comparison in ElevenLabs vs. Cartesia.

The pattern that works: premium components in the layers the caller perceives directly (TTS, the LLM driving the conversation), cost-optimized components in the layers they don’t (STT, fallback LLMs for non-critical paths), and evaluation infrastructure across the whole stack so you can verify each decision empirically.

Choosing your platform

The platform layer is where most teams make their first irreversible decision. A few patterns:

Vapi. Developer-first, deeply customizable, supports any STT/LLM/TTS combination. Popular with technical teams who want fine-grained control. We covered the product in our Vapi review for 2026 and the partnership story in Vapi and Coval.
Retell AI. Higher-level platform optimized for call center deployments. Faster time-to-production than Vapi for vertical use cases, less flexibility on the underlying stack.
LiveKit. Open-source transport layer that’s become the default for speech-to-speech architectures. Often combined with Pipecat or a custom orchestrator.
Pipecat. Open-source orchestration framework. Strong community, good support for custom architectures, requires more engineering investment than the managed platforms.
ElevenLabs Conversational. Conversational AI product built on ElevenLabs’ TTS. Strong on voice quality, less flexibility on the LLM and orchestration layers.
Sierra, Decagon, Replicant, Cresta, Parloa. Higher-level customer service AI platforms with opinionated workflows and pre-built integrations. Faster to deploy if your use case fits their vertical; harder to customize off the happy path. Parloa has particularly strong traction in European enterprise contact centers.
Custom builds. Some teams (typically those at the larger end of voice AI deployments, or those with narrow requirements that no platform fits) build their own stack on top of open-source components. This is a meaningful engineering investment but produces the most control and the lowest unit cost at scale.

The platform comparison decision usually comes down to four questions: how much customization do you need, how much engineering capacity do you have, what’s your timeline to production, and what compliance posture do you need. We compared the leading platforms with performance data in voice AI platform comparison 2026.

Deployment patterns that work

Across hundreds of voice AI deployments, a few patterns have proven out. Teams that follow them tend to ship faster and survive production better than teams that don’t.

Start with a narrow, high-volume use case

The fastest path to ROI is a voice agent that handles one type of call well, at high volume. Appointment confirmation, password reset, order status, basic FAQ. These are the use cases where voice AI shines. Trying to build a general-purpose voice agent that handles the long tail of every possible call type is the most common reason ambitious deployments stall.

Expand scope once the narrow use case is reliably handling production traffic. Adding capability to a working agent is much easier than building a comprehensive agent from scratch.

Build for failure handling from day one

Most production voice AI failures happen at the integration layer rather than the model: the API returns an unexpected format, the call drops mid-transaction, the tool times out, the memory lookup returns stale data. Agents that don’t have explicit handling for these conditions fail loudly when they hit production.

The pattern: every tool call has a timeout, a retry policy, a fallback behavior, and a clear escalation path to a human agent when the agent can’t complete the goal. Treat these as core agent behavior and build them in from day one, before launch hardening.

Treat the human agent as a first-class part of the system

The teams shipping voice AI at scale have redesigned the human role around the voice agent: humans now handle the cases where the voice agent shouldn’t or can’t complete the goal, while the agent takes the high-volume repetitive work. This means investing in handoff quality (context summarization, warm transfer, intent routing) at least as much as agent quality.

A voice agent that escalates well is more valuable than a voice agent that tries to handle everything and occasionally fails badly. The escalation path is part of the product.

Instrument before you ship

Production observability is hard to retrofit. Teams that wait until after launch to think about logging, metrics, and quality grading end up with months of production traffic they can’t analyze. Instrument the agent during development with the same observability you’ll want at scale: full conversation recordings, tool-call traces, latency metrics, grading metadata. We covered this in voice AI evaluation infrastructure.

Run a regression suite on every change

A prompt tweak that improves Behavior A often regresses Behavior B. Without automated regression testing on every release, teams enter a whack-a-mole pattern that frustrates everyone involved. A scenario library of 100 to 500 test conversations, run automatically before any change ships, catches the regressions early. The teams that don’t do this ship slowly, fearfully, and with a lot of unplanned weekends fixing production issues. See the three-layer testing framework for voice AI for the deeper methodology.

Evaluation: the part most teams underestimate

This is the section that most decides whether the voice AI agent reaches production successfully.

A voice agent that handles 95 percent of test calls in development typically handles 60 to 70 percent of real calls once it meets production conditions. The gap comes from coverage rather than model quality. The test set didn’t include the audio quality, accents, edge cases, integration failures, or adversarial behaviors that real callers produce.

The teams that close this gap have built evaluation infrastructure across three dimensions.

Pre-production simulation

Before any agent change ships, it runs against a library of test scenarios. The scenarios cover the agent’s expected behavior: happy paths, edge cases, audio variation, accent variation, adversarial callers, integration failures, tool-call edge cases. The implementation primitive that holds this together is simulations: scripted personas and conditions that exercise the agent like a real caller would. The simulation produces graded conversations that show whether the change improved or regressed the agent’s behavior across the suite.

The scenario library starts small (50 to 100 scenarios for an early-stage agent) and grows over time. Every production incident becomes a new scenario. Every customer complaint becomes a new scenario. Every clever attack a tester finds becomes a new scenario. The library is the institutional memory of every failure mode the team has ever encountered, codified into automated tests.

Production observability

Once the agent is live, every conversation gets graded against the same criteria the simulations use, drawing on the same evaluation metrics the dev suite used. Resolution rate, escalation accuracy, tone, compliance adherence, tool-call success. The output is a dashboard that surfaces drift before it becomes a customer-affecting problem. Conversations that fall below threshold automatically route to human review so a person can confirm or correct the grade, and the calibrated examples flow back into the eval suite.

Drift happens. Model vendors update their models without changing version strings. Backend APIs change their response schemas. Caller demographics shift with marketing campaigns. The agent that worked last month doesn’t necessarily work this month, and production observability is how you find out without waiting for the complaints. We covered this in what is voice AI observability.

The feedback loop

The most underrated part of the system. Failures caught in production get reproduced as scenarios in simulation, so they don’t recur. Patterns from real call data get used to generate new test scenarios. The agent improves over time because the evaluation infrastructure is collecting evidence and feeding it back into the development cycle.

One methodology that’s proven to work at scale comes from self-driving cars. Waymo doesn’t ship a software update because it worked once in a test drive. They ship because the update passes millions of simulated miles, the regression suite hasn’t degraded, and the production fleet’s metrics confirm the improvement. The same pattern works for voice AI agents, and it’s the strategic difference between agents that ship and agents that don’t.

Our guide on voice AI agent evaluation covers the methodology in depth.

Compliance and regulatory considerations

For regulated industries (healthcare, financial services, government, regulated consumer verticals), compliance affects almost every architecture decision.

HIPAA. Healthcare deployments need attested STT, LLM, and TTS providers. The major commercial vendors have HIPAA business associate agreements; some smaller vendors don’t.

SOC 2. Standard for enterprise B2B deployments. Most major platforms in the voice AI ecosystem have SOC 2 Type 2 reports.

PCI. Required when the agent handles payment card data. Either avoid PCI scope by routing those interactions to dedicated systems, or build PCI-compliant infrastructure (a meaningfully larger engineering investment).

FedRAMP. Government deployments. Narrows the set of viable vendors to a short list.

State and regional laws. California consent recording laws, GDPR for European callers, Australian data residency requirements. The deployment surface determines which laws apply.

The pattern: get compliance requirements explicit early, treat them as architectural constraints rather than compliance theater, and choose components that already have the attestations you need. Adding compliance to a deployed voice agent after the fact is meaningfully harder than building with it in mind.

Cost economics

Voice AI agent unit economics depend on the architecture and volume. A rough framework for cost per minute of conversation in 2026:

STT. $0.005 to $0.012 per minute depending on provider and language.
LLM. Variable. $0.005 to $0.05 per minute for a typical voice agent conversation, depending on the model and the verbosity of the agent.
TTS. $0.01 to $0.05 per minute, with premium TTS at the higher end.
Telephony. $0.01 to $0.02 per minute for inbound, more for outbound.
Platform fees. $0.02 to $0.10 per minute on top, depending on the platform.

A typical production voice agent costs $0.05 to $0.30 per minute of call time. The economics get more attractive at higher volume, with cost-optimization opportunities in the LLM and STT layers being the biggest levers.

The ROI calculation usually compares this to the all-in cost of a human agent (typically $1.00 to $2.00 per minute including overhead), which makes the math compelling at most call volumes. The harder question is the cost of getting the agent wrong: a misclassified emergency, a missed compliance disclosure, a customer churn event. Evaluation infrastructure is how teams keep that cost manageable.

Common pitfalls when deploying voice AI agents

The same patterns of failure repeat across teams.

Over-scoping the v1. Trying to build a general-purpose agent that handles the long tail of every possible call type. The agents that ship start narrow and expand from a working foundation.

Underestimating audio realism. Test sets that use clean studio audio recorded in quiet rooms will not predict production performance. Real phone audio with road noise, restaurant background, accent variation, and connection quality issues is essential to the test set.

Mocking the integrations. Mocked CRMs, EHRs, and payment processors are fine for unit tests, dangerous as the only evaluation. Most production bugs live at the integration layer.

No regression testing. Prompt and model changes silently regress behaviors that used to work. Automated regression testing is the difference between confident deployment and fearful deployment.

Treating evaluation as a launch gate. Evaluation is continuous. Teams that treat it as a one-time launch checklist ship agents that pass the gate and then degrade silently over time.

Skipping the escalation design. A voice agent that can’t gracefully hand off to a human when it should be escalating is creating a worse customer experience than the IVR menu it replaced.

What good looks like

A voice AI agent operating reliably in production at a mid-to-large enterprise in 2026 typically looks like:

A focused use case, well-instrumented, handling thousands of calls per day.
A cascaded model stack with attested vendors, with the specific model choice empirically validated against the team’s own data.
A scenario library of several hundred test conversations, run automatically on every change.
Production observability surfacing behavioral, functional, and operational metrics in real time.
A clear escalation path to human agents for cases the agent shouldn’t or can’t handle.
A feedback loop that turns production incidents into new test scenarios within days.
Quarterly re-evaluation against alternative model choices and vendor updates.

The teams operating at this bar treat the agent as production infrastructure and apply the same discipline they’d bring to any other business-critical system. Chasing the latest model takes a back seat to maintaining a steady evaluation rhythm.

Where to go from here

Voice AI agents are now production infrastructure. The teams shipping them well have figured out that the architecture, platform, and model choices are tactical decisions. The strategic decision is whether to invest in evaluation infrastructure that makes the agent shippable, debuggable, and improvable over time.

If you’re earlier in the journey, our guide on how to evaluate voice agents is the practical next read. If you’re further along and want to talk through what evaluation looks like for your specific deployment, book a call with the Coval team. For real-time voice AI performance data across providers, see benchmarks.coval.ai.

Frequently asked questions

How long does it take to deploy a voice AI agent?

For a focused use case on a managed platform like Vapi or Retell, a working production agent takes 8 to 16 weeks: a few weeks to build the agent, several weeks to integrate with internal systems, and the remainder for evaluation, edge case handling, and compliance review. Teams with evaluation infrastructure in place from the start compress the back half (often to weeks rather than months).

Should I build my own voice AI stack or use a managed platform?

Most teams should use a managed platform. The build path requires real engineering investment that competes with the business logic of the agent itself. Custom builds make sense when volume is in the millions of minutes per month, customization needs are extreme, or compliance requirements are restrictive enough to rule out the available platforms.

What’s the most important factor in voice agent quality?

Evaluation infrastructure. The architecture, model, and platform choices matter, but they matter less than the team’s ability to measure agent quality at the speed and depth that production deployment requires. Teams with strong evaluation infrastructure can switch architectures, models, and platforms as conditions change. Teams without it are stuck with whatever decision they made first.

How do I know if my voice agent is ready for production?

It’s ready when the regression suite is stable across releases, the simulation results match what production data shows, the escalation path is reliable, the compliance review is complete, and the team is confident they can detect drift before customers do. Most teams ship before all of these are true, and find out which one mattered most after launch.

What’s the difference between voice AI agents and traditional IVR systems?

Traditional IVR systems follow deterministic menu trees: press 1 for billing, press 2 for support. Voice AI agents handle open conversation: callers can describe what they need in their own words, the agent decides what to do, and the agent can complete tasks the IVR would have handed off to a human. The user experience is meaningfully different; the operational requirements (evaluation, observability, compliance) are also meaningfully different.