Voice AI Agent Evaluation: The Complete Guide (2026)
Henry Finkelstein, Founding Growth Engineer
Last Updated:
Last updated: April 2026
Reading time: 35 min
Author: Coval — simulation and evaluation infrastructure for voice AI, founded by Brooke Hopkins (ex-Waymo evaluation infrastructure)
Key Takeaways
Voice AI agent evaluation is the discipline of testing and monitoring voice agents so they work reliably in production, not just in demos.
Most teams plateau at ~85% performance because manual QA doesn't scale. Getting from 85% to 95% requires programmatic evaluation. Getting from 95% to 99% is the frontier.
Start simple: define what your agent must always do and must never do. Test with the easiest personas first. Expand from there.
Evaluation infrastructure isn't a one-time project. It's a continuous improvement loop that compounds over time, turning every production failure into a future test case.
The teams that win are the ones with the best evaluation infrastructure on day 100, not the best agents on day one.
Table of Contents
Why We Wrote This Guide
The Voice AI Agent Maturity Curve
Start By Listening: Why Manual QA Is Where Every Team Should Begin
Your First Automated Evals: Start With What Must Always (and Never) Happen
Broadening Personas, Deepening Coverage
CI/CD for Voice Agents: Treating Releases Like Software Releases
Building Trust in Evals with Human Review
Communicating Evaluation Strategy Across the Organization
From Dev to Prod: Why Production Is Never the Same
Expanding Into the Unknown Unknowns
The Long Horizon: Building an Evaluation Platform That Grows With You
Why We Wrote This Guide
Definition: Voice AI agent evaluation is the practice of testing, measuring, and monitoring voice-based AI agents so they perform reliably across real-world conditions, not just controlled demos.
Most voice AI teams are building blind.
They ship agents that sound great in demos, field-test them with a handful of friendly calls, and cross their fingers when the real traffic arrives. When something breaks (and it always breaks) they find out from angry customers, not from their evaluation infrastructure. Because most teams don't have evaluation infrastructure.
We wrote this guide because we believe the world is genuinely a better place when voice agents are effective, useful, and enjoyable to interact with. Not as a marketing slogan. As a conviction that shapes how we spend our time. Every caller who gets routed correctly, every patient who receives accurate information, every customer who doesn't have to repeat themselves three times: that's the outcome good evaluation produces.
We've worked alongside hundreds of teams building voice AI across healthcare, insurance, fintech, government, logistics, and quick-service restaurants. We've seen the same maturity curve play out over and over: the early excitement, the first production fire, the "oh, we actually need a testing strategy" realization, and, for the teams that push through, the compounding returns of disciplined evaluation.
This guide captures what we've learned from watching that movie hundreds of times. It's for anyone building or deploying voice agents, regardless of what tools you use. The principles are universal, even if the specifics of your stack are unique.
One more thing: before founding Coval, our CEO Brooke Hopkins built evaluation infrastructure at Waymo. The core insight she brought to voice AI is the same one that made autonomous vehicles viable: you don't get to production-grade reliability by driving more miles. You get there through simulation. Voice AI is converging on the same path. This guide is the map.
The Voice AI Agent Maturity Curve
Here's a framework we've developed from working with teams at every stage of voice AI deployment. We call it the maturity curve, and it maps the journey from "it works in a demo" to "it works at scale in production, every time."
The curve has four stages. Each one requires a different approach to evaluation.
Stage | Performance | What It Looks Like | How Teams Get Here |
|---|---|---|---|
Stage 1: Out of the Box | ~70% | Happy paths work. Demos feel magical. Edge cases are invisible because nobody's testing for them. | Pick an orchestration platform. Write some prompts. Ship it. |
Stage 2: Manual Effort | ~85% | Prompt tuning, listening to calls, ad hoc fixes. The team can feel the quality improving, but every fix is manual and nothing is systematic. | Hire QA. Listen to calls. Tweak prompts. Repeat. |
Stage 3: Programmatic Evals | ~95% | A diligent, iterative evaluation strategy. Automated test suites. Regression detection. The team has shifted from reactive firefighting to proactive quality engineering. | Build evaluation infrastructure. Instrument automated evals. Run them continuously. |
Stage 4: The Frontier | ~99% | Production feedback loops. Cross-functional buy-in. Cost-optimized test orchestration. Every production failure becomes a future test case. | Mature eval platform, human review calibration, continuous improvement loops, cross-functional reporting. |
If you're at Stage 1, you're not behind. You're where every team starts. The platforms and models have gotten good enough that 70% is table stakes. You get that for showing up.
The jump from Stage 1 to Stage 2 is effort. You hire people, you listen to calls, you grind. Most teams can muscle their way to 85%.
The jump from Stage 2 to Stage 3 is the hard one. This is where manual effort hits a wall. You can't listen to 2,000 calls a week. You can't run 200 test calls before every deploy. You need evaluation infrastructure: programmatic, automated, repeatable. This guide is primarily about making that jump.
The jump from Stage 3 to Stage 4 is where art meets science. It's less about running more tests and more about running the right tests, at the right time, at the right cost, and getting the results in front of the right people. The teams operating at 99% have turned evaluation into a competitive advantage, not a cost center.
"Even if you're 99%, you're still getting 2,000 bad calls." — Founder & CEO, voice AI startup at scale
That quote lands differently depending on where you are on the curve. At 200 calls a month, 1% is 2 bad calls. Annoying but manageable. At 200,000 calls a month, it's 2,000 bad calls, and each one is a real person having a bad experience. Scale changes everything, and evaluation is how you keep quality from degrading as volume grows.
The rest of this guide walks you through each stage of the maturity curve, from the very beginning to the frontier. Start wherever you are. (If you're evaluating voice AI testing tools, you may also find our comparisons of Coval vs. Cekura and Coval vs. BlueJay helpful for understanding the vendor space.)
Start By Listening
Here's something that might surprise you coming from a company that builds evaluation infrastructure: in the beginning, manual QA is wise.
When your agent is handling a dozen calls a day, you absolutely can and should listen to every single one. Read every transcript. Note what worked, what didn't, what surprised you. Far from primitive, this is how you build the intuition that makes everything else possible.
Manual QA at low volume gives you three things that no automated system can:
Pattern recognition you can't codify yet. You'll hear failure modes you wouldn't have thought to test for. A caller who trails off mid-sentence. A background noise pattern that confuses the transcriber. An accent that shifts the agent's intent classification. These become the seeds of your future test suite.
Calibration of what "good" actually sounds like. Before you can write an eval rubric, you need to know what you're evaluating against. Listening to real calls builds a visceral sense of quality that no specification document can replace.
A reality check on your assumptions. The calls you imagined during development are not the calls you'll get in production. Not even close.
"Our current solution is not scaling. And by current solution it means manual QA testing and calling." — Voice AI Architect, healthcare tech company
That quote captures the transition point perfectly. Manual QA is the right tool until it isn't. The breaking point usually arrives faster than teams expect.
We recently spoke with an engineer at an AI recruiting platform who described the exact inflection point. His team had five people doing manual QA: "five people testing different machines, different scenarios, different scripts." When call volume was 20 per customer per month, that worked. Then a client asked for 6,000 calls a month. Last week alone, they processed 250 interviews across 10 languages, with call durations ranging from 5 to 60 minutes. The five-person QA team went from adequate to "unrealistic" overnight.
That's the signal. When you catch yourself thinking "we need more QA people" instead of "we need better QA tools," you've outgrown manual testing.
The transition doesn't have to be abrupt. The best teams don't flip a switch from manual to automated. They start by automating the most repetitive checks: did the agent introduce itself? Did it verify identity before routing? They keep manually reviewing the calls that matter most. Over time, the automated coverage expands and the manual review becomes targeted rather than exhaustive.
But don't rush it. If you're at a dozen calls a day, put your headphones on and listen. You'll learn more in an afternoon of listening than in a week of building test infrastructure.
Your First Automated Evals
You've listened to enough calls to know what matters. Now it's time to codify that intuition into your first automated evaluations. And the best place to start is deceptively simple.
The Must-Always / Must-Never Framework
Two lists. That's it.
What must your agent ALWAYS do?
Always verify caller identity before accessing account information
Always offer a path to a human agent when the caller requests one
Always confirm critical actions before executing them (transfers, cancellations, bookings)
Always follow the required disclosure sequence (HIPAA, TCPA, state-specific)
What must your agent NEVER do?
Never provide medical, legal, or financial advice
Never repeat back sensitive information (SSNs, credit card numbers, full account numbers)
Never continue a conversation after the caller has asked to stop
Never skip required compliance disclosures
Never make promises about outcomes it can't guarantee
These are your base cases. They're not sophisticated. They're not clever. They are the floor. The absolute minimum standard your agent must meet before anything else matters.
Here's what this looks like across different verticals:
Vertical | Must Always | Must Never |
|---|---|---|
Healthcare | Verify patient identity; route emergencies immediately; deliver HIPAA disclosure | Give medical advice; confirm/deny diagnosis; share PHI with unverified callers |
Fintech | Verify identity before account access; disclose call recording; follow mini-Miranda for collections | Provide investment advice; disclose account details to unverified parties; skip state-specific disclosures |
Insurance | Capture claim details accurately; route to adjuster when required; confirm coverage scope | Guarantee claim outcomes; provide legal interpretation of policy language; skip fraud screening questions |
QSR / Retail | Confirm order back to caller; apply correct pricing; handle dietary/allergy inquiries accurately | Charge without confirmation; ignore allergy-related questions; process orders for closed locations |
"This is not a toy. This is something necessary that otherwise we will have to build, and it will take us many months." — Staff ML/AI Engineer, major insurance carrier
Start With Easy Mode
Your first test personas should be the simplest possible callers. No interruptions. Even, neutral tone. No background noise. Clear enunciation. Standard American English. One intent per call. No curveballs.
Think of this as a driving test on an empty parking lot. If your agent can't pass easy mode, nothing else matters. It needs to nail the basics (greetings, identity verification, intent classification, task execution, sign-off) before you start adding complexity.
Key concept — LLM-as-judge: Instead of manually reviewing every call, you can use a large language model to evaluate conversations against your rubrics automatically. The LLM reads the transcript (or listens to the audio) and scores it against specific criteria. This is the foundation of scalable voice AI evaluation. See UC Berkeley's research on LLM-as-judge methodology for the technical underpinnings.
One Eval Dimension Per Metric
This is a mistake we see constantly: teams bundle multiple criteria into a single metric. "Knowledge base utilization" that checks both factual accuracy and response completeness. "Conversation quality" that blends tone, pacing, and resolution into one score.
When a bundled metric fails, you have no idea what failed. Was the information wrong? Was it right but incomplete? Was it complete but delivered in a robotic monotone?
Break it down. Each metric measures exactly one thing:
Task completion: Did the agent accomplish what the caller needed?
Factual accuracy: Was the information provided correct?
Compliance adherence: Were all required disclosures and procedures followed?
Conversation flow: Did the agent handle turn-taking, interruptions, and transitions naturally?
Escalation appropriateness: Did the agent hand off to a human when it should have, and not when it shouldn't have?
For a deeper dive on which metrics matter most, see our breakdown of the 5 metrics that actually predict production success. To map specific metric definitions to Coval primitives, see the metrics documentation.
"If I can buy this, I don't need to build this." — VP Engineering, regulated healthcare platform
Run your must-always and must-never checks against your easy-mode personas. Get to 100% pass rate on these base cases before moving on. This is your regression floor, the foundation everything else builds on.
Broadening Personas, Deepening Coverage
Your agent passes easy mode. Every must-always fires. Every must-never holds. Congratulations. You've established the floor. Now it's time to find out where the ceiling is.
Broadening and deepening are two separate axes. The teams that get to 95% work both of them on purpose.
Broadening: Harder Callers
Real production traffic doesn't sound like your easy-mode personas. Real callers have accents, background noise, emotional states, and conversational habits that your clean-room tests will never replicate.
Here's what a persona progression looks like:
Difficulty | Caller Characteristics | What It Tests |
|---|---|---|
Easy | Clear speech, neutral tone, single intent, no background noise, standard dialect | Baseline functionality — does the agent work at all? |
Medium | Slight accent, minor background noise (office, car), two intents in one call, occasional filler words | Real-world readiness — does the agent handle normal human speech? |
Hard | Strong regional accent, significant background noise (restaurant, subway), emotional caller (frustrated, confused), mid-conversation topic pivots | Resilience — does the agent recover from real-world conditions? |
Adversarial | Non-native speaker, loud environment (construction, crying child), angry caller post-IVR wait, contradictory instructions, attempts to social-engineer the agent | Stress limits — where exactly does the agent break? |
"With the audio there's so many — I've got noise, I've got barking dogs, I've got kids, I've got accents that we're really kind of struggling to get better at." — Chief Engineer, Agentic AI Platforms, major healthcare services company
"Most voices that are being sold are modeled against people that speaks very well. But when you look at people that are actually speaking on the phone, they're going to have stuttering." — Staff ML/AI Engineer, major insurance carrier
These quotes point to the same blind spot: voice AI models are trained and tested on clean speech, but production traffic is messy. The evaluation gap between clean-room testing and real-world performance is where the most consequential failures hide. For a deeper look at how different voice AI testing tools handle this challenge, see our comparison of evaluation platforms.
Deepening: More Functionality Under Test
While broadening makes the callers harder, deepening expands what you're evaluating:
Multi-turn conversation flows: Does the agent maintain context across 5, 10, 15 turns? What happens when the caller circles back to a topic from earlier?
Tool call accuracy: When your agent talks to a POS, EHR, or backend API, does it pass the right parameters? One team discovered that their agent understood intent perfectly but was injecting malformed orders into the POS. Invisible without trace-level evaluation.
Conversation pivots: "Actually, cancel that — I want to change my address instead." How does the agent handle mid-call intent shifts?
Escalation judgment: Does the agent escalate at the right moment, with the right context? Not too early, not too late?
Concurrency and load: Does the agent maintain quality at 10x or 100x your current call volume? Voice AI infrastructure breaks differently under load than text systems. See our guide to voice load testing for stress-test methodology.
The Whack-a-Mole Trap
Teams that broaden without a regression foundation play whack-a-mole.
"It feels like we're constantly playing a game of Whack-a-Mole... we solve one thing and then some other edge case comes up." — Founder & CEO, voice AI startup at scale
The fix is disciplined sequencing. Lock your base cases first. Get your easy-mode regression suite to 100%. Then add medium-difficulty personas and run the full suite. If the new personas break something that easy mode caught, you've found a genuine regression. If they break something new, you've found a coverage gap. Add it to the suite and keep going.
One more thing to watch for: the LLM people-pleaser problem. If you're using LLMs to simulate test callers, they will bend over backwards to make the conversation succeed. They don't stammer. They don't get confused. They don't say "uh, wait, no, the other thing." They're cooperative to a fault.
"It will do everything in its power to make sure it gets through a conversation." — Forward-deployed engineer, enterprise voice AI platform
Real callers are not cooperative. They're confused, frustrated, distracted, and sometimes actively adversarial. Your test personas need to reflect that, or your evaluations are testing a reality that doesn't exist.
CI/CD for Voice Agents
Your evaluation suite is growing. You've got base cases, medium-difficulty personas, and expanding coverage. The question shifts from "do we have evals?" to "when do they run?"
The answer should be: automatically, on every change that could affect agent behavior.
Key concept — CI/CD for voice agents: Continuous Integration / Continuous Deployment applied to voice AI means automated evaluation suites that trigger on every code change, prompt update, or model swap, blocking releases that don't meet quality thresholds. This is the same principle that Martin Fowler's CI/CD methodology established for software, adapted for probabilistic voice systems.
Think of voice agent evaluation the way a software engineering team thinks about CI/CD. Every deploy, every prompt change, every model update, every knowledge base revision triggers a test suite. If the suite fails, the change doesn't ship.
This isn't theoretical. If you've ever shipped a "small prompt tweak" on a Friday afternoon, you know the feeling in your stomach. One team changed their agent's greeting from "How can I help you?" to "What can I help you with?". A change that seemed cosmetic. Booking completion dropped 12%. They didn't catch it for two weeks.
"I dread the moment we have to change the workflow." — Staff ML/AI Engineer, major insurance carrier
That dread is rational. In traditional software, a bad deploy breaks a feature and you roll back. In voice AI, a bad deploy means real people have bad experiences on live calls. You might not even know it's happening until the complaint volume spikes.
Structuring Test Suites by Cadence
Cadence | What Runs | Purpose |
|---|---|---|
Per-deploy | Core regression suite (must-always, must-never, easy-mode personas) | Block bad releases. This is the gate. |
Nightly | Extended regression + medium-difficulty personas | Catch regressions that slip through fast checks |
Weekly | Full suite including hard/adversarial personas + edge cases | Comprehensive coverage sweep |
On-demand | Targeted tests for specific changes (new prompt, new tool, new model) | Validate specific hypotheses |
The per-deploy suite should be fast. Minutes, not hours. It tests the foundation: does the agent still do what it must always do? Does it avoid what it must never do? Does it handle the happy paths?
The nightly and weekly suites can be thorough. They're not blocking a deploy. They're building confidence and expanding coverage.
Agent-native tooling matters here. The teams that actually run evals in CI/CD are the ones whose eval tools meet them where they work. That means CLI interfaces, API access, integration with existing development workflows, and compatibility with the tools engineers already use: coding assistants, MCPs, terminal-native workflows.
Here's the litmus test: if running an eval requires logging into a web dashboard, clicking through a wizard, and waiting for results in a separate tab, it won't get run on every deploy. It'll get run when someone remembers, which is a polite way of saying it won't get run. The eval platform needs to be as ergonomic as the development environment itself, which is why agent-native CLI workflows are becoming the standard for evaluation tooling.
The direction this is heading: evaluation as a first-class participant in the development loop, not an afterthought bolted on at release time. The same way linters and type checkers became invisible infrastructure that just runs, voice AI evals are heading there too.
Threshold gates: Define explicit pass/fail criteria. Not "the results look okay." Quantitative thresholds. Task completion rate above 95%. Compliance adherence at 100%. Average response latency under 500ms. If any threshold is breached, the deploy is blocked and the team investigates.
If you're moving from legacy IVR to LLM-driven voice agents, the same principle applies — the test suite is what makes the migration safe. See our guide to automated IVR testing for the regression patterns that catch the most common migration regressions. To wire eval gates directly into your existing pipeline, the GitHub Actions integration is usually a 30-minute setup.
Building Trust in Evals with Human Review
You've built automated evals. They run on every deploy. The dashboard shows green. But before you start sharing these results with your VP of Product, your compliance team, and your board, there's a critical step most teams skip.
You need to prove the evals are right.
We learned this one the hard way. LLM-as-judge evaluation scales in ways human review never can. But LLM judges have failure modes: they can be overly generous, miss subtle tone issues, or apply rubrics inconsistently across edge cases. If you share eval results that don't match what a human would conclude, you've burned trust. Rebuilding trust in evaluation infrastructure is harder than building it the first time.
The calibration workflow looks like this:
Run your automated evals on a batch of calls (50-100 is a good starting sample).
Sample a subset for human review. Have a human evaluator score the same calls against the same rubrics.
Compare the scores. Where do the LLM judge and the human agree? Where do they diverge? Are the disagreements random or systematic?
Iterate on the rubrics. If the LLM judge is consistently too lenient on tone, tighten the rubric. If it's flagging false positives on compliance, add edge case examples to the evaluation prompt. If it's missing a failure mode entirely, add a new metric.
Repeat until alignment is high. "High" means you trust the automated results enough to act on them without double-checking every one.
"They want to see quality test cases right out of jump. They've seen a platform where it generates a bunch of test cases and maybe half of them are garbage." — Voice AI Architect, healthcare tech company
That quote is about test case generation, but the principle applies equally to eval results. If your team's first experience with automated evaluation is a batch of results that don't match reality, the entire system loses credibility. First impressions matter.
Practical tips for the calibration phase:
Start with binary metrics. Did the agent verify identity? Yes or no. Did it skip the disclosure? Yes or no. Binary metrics are easiest to calibrate because disagreements are obvious.
Graduate to scalar metrics carefully. "Conversation quality on a 1-5 scale" is where LLM-human disagreement gets messy. Define explicit rubrics for each score level with concrete examples.
Use disagreements as training data. Every case where the LLM and human disagree is a rubric improvement opportunity. These edge cases are gold.
Don't expect 100% agreement. Even two human reviewers won't agree 100% of the time. Aim for >85% agreement on binary metrics and within ±1 point on scalar metrics.
The payoff: Once your team trusts the automated evaluations, they become the shared language for quality. Product can reference eval scores in sprint planning. Engineering can set threshold gates with confidence. Compliance can run audits against eval data instead of manually reviewing call recordings. But none of that works if the eval accuracy isn't proven first. (For workflow specifics on how to run a calibration cycle in Coval, see human review.)
Communicating Evaluation Strategy Across the Organization
Once you trust your evals, the next challenge is getting the rest of the organization to trust them too.
Voice AI programs are cross-functional by nature. The agent touches Product, Engineering, QA, Compliance, Customer Success, and Executive Leadership. Each cares about evaluation differently. The most common failure mode we see isn't "the team didn't build evaluation infrastructure." It's "the team built it but only engineering uses it." Product never sees which intents are failing. Customer Success never sees which accounts are degrading. Compliance never sees the red team results. The infrastructure exists; the organization doesn't benefit from it.
Here's the stakeholder map:
Stakeholder | What They Care About | How to Message Eval Value | The Metric on Their Dashboard |
|---|---|---|---|
Product | User experience, feature adoption, roadmap prioritization | "Evaluation tells you which conversation flows to fix first — ranked by user impact, not engineering intuition." | Resolution rate by intent, conversation drop-off points |
Engineering | System reliability, deploy confidence, debugging speed | "Evaluation is your CI/CD for voice — regression tests block bad deploys, observability cuts debugging from days to hours." | Test pass rate per deploy, latency percentiles |
QA | Test coverage, defect detection, release readiness | "Automated evaluation replaces the 200-call manual sprint with continuous coverage that scales to every accent, every edge case, every deploy." | Scenario coverage %, defect escape rate |
Compliance | Regulatory exposure, audit readiness, incident liability | "Evaluation proves your agent handles disclosures, PII, and workflow ordering correctly before regulators or plaintiffs ask." | Compliance pass rate by regulation, disclosure timing accuracy |
Customer Success | Account health, expansion signals, churn prevention | "Evaluation data shows which accounts have degrading agent quality before customers complain." | Per-account resolution trends, quality score trajectories |
Executive Leadership | ROI, strategic risk, competitive positioning | "Evaluation turns voice AI from a bet into a measured investment — here's what it costs, what it saves, and where the opportunities are." | Cost per resolution vs. human baseline, quality trajectory |
"Without FedRAMP I would have been in production six months ago." — VP Engineering, regulated healthcare platform
That's a compliance stakeholder's pain, expressed in engineering terms. The evaluation team that surfaces "we've passed 100% of HIPAA disclosure checks across 10,000 simulated calls" to the compliance team, in their language, at their cadence, is the team that gets budget for year two. (For more on how compliance requirements shape voice AI evaluation, see NIST's AI Risk Management Framework.)
The ownership question: Who owns the evaluation suite? In the most effective teams we've seen, it's a single evaluation owner (usually a senior engineer or technical PM) who maintains the test suite and distributes results to each stakeholder in their language. Without a single owner, evaluation becomes everybody's responsibility and nobody's priority.
Reporting cadence:
Weekly for Engineering: deploy-level results, regression alerts, new failure modes discovered
Monthly for Product and Customer Success: trend-level analysis, intent-level performance, account health signals
Quarterly for Compliance and Legal: regulation-level audit results, red team findings, incident exposure analysis
Bubbling up to the C-suite: Senior leaders don't need test pass rates. Sending them a 20-page test report is a great way to ensure they never read anything you send again. They need three things: (1) is this investment working, (2) what's the risk if we don't invest more, and (3) what strategic opportunities is the data revealing. One slide. Resolution rate trends, cost per resolution vs. human agents, quality trajectory month-over-month. That's the whole conversation.
Where this is heading: the teams that get cross-functional buy-in on evaluation are changing how their organization makes decisions about voice AI. Evaluation data becomes the shared ground truth that replaces anecdotes, gut feelings, and "it sounded fine to me" in planning meetings.
From Dev to Prod
Here's the uncomfortable truth about production: it is never the same as dev. Not close. Not approximately. Not "mostly the same with a few edge cases." Production is a different animal entirely, and the evaluation strategy that gives you confidence in staging will mislead you if you assume it transfers unchanged.
The pattern that keeps us up at night: agents that work flawlessly in controlled demos routinely fail when they hit real production traffic. As of early 2026, we've seen teams across healthcare, fintech, and QSR report first-week production success rates 20-30 percentage points below their pre-launch test results. That gap shows the distance between controlled conditions and reality.
In dev, your test callers speak clearly. In production, they're on speakerphone in a moving car with the radio on. In dev, your callers have one intent. In production, they open with "I need to change my address, also check my balance, oh and can you tell me why I got charged twice last month?" In dev, the network is stable. In production, 4G connections drop packets and add latency. In dev, callers are neutral. In production, they've just waited through a 12-minute IVR menu and they're furious before your agent says hello.
"Until shit hits the ceiling, people don't realize that observability is important." — Product Leader, voice AI platform vendor
Run the Same Eval Suite on Production Trace Logs
Not a different, lighter eval. Not monitoring dashboards. The same rubrics, the same metrics, the same pass/fail criteria you use in dev, applied to real production conversations.
You don't need to evaluate every call. Start with a subsample:
10-20% of all calls get full eval suite coverage
100% of calls get lightweight anomaly detection (latency spikes, early hang-ups, escalation triggers)
Flagged calls (anomalies, complaints, escalations) get full eval + human review
Key concept — Voice observability: The ability to understand what's happening inside your voice agent in production — not whether calls completed, but how the conversation flowed, where it struggled, and why. It includes full conversation traces, audio analysis, and component-level performance attribution. For a primer on what voice observability covers and why text-based observability tools fall short, see our explainer on voice AI observability.
The Four Pillars of Voice Observability
Conversation content — Full transcripts and original audio. Transcripts alone miss critical information. A transcript that reads "yes, I'd like to cancel" sounds very different when the caller is crying vs. angry vs. matter-of-fact.
Context signals — Who called, from where, what time, what device, what was their history? A caller's third attempt to resolve the same issue should be handled very differently from a first-time call.
Outcome data — Did the task get completed? Was it resolved on first contact? Did the caller call back within 24 hours? Resolution rate beats containment rate as a north star metric. Containment measures deflection, not success.
Performance metrics — Turn-by-turn latency, component breakdowns (ASR time, LLM time, TTS time), confidence scores. When a call goes bad, these traces tell you where in the stack it went bad: ASR misheard, LLM misclassified, TTS garbled, or the orchestration layer dropped context. (For real-time voice AI performance data across providers, see Coval's voice AI benchmarks.)
The production eval data should feed directly back into your development cycle. Calls that fail in production but would have passed in dev are the most valuable signals you'll ever get. They reveal the gap between your test environment and reality.
Expanding Into the Unknown Unknowns
Your regression suite covers the known scenarios. Your adversarial personas test the hard cases. Your production monitoring catches failures in real time. But there's a category of failure that none of these fully address: the things you never thought to test for.
"It's not a matter of if, but just when something shuts down some key workflow." — Head of CX Technology, consumer health and wearables company
"We're not going to have visibility to all the different variations." — Staff ML/AI Engineer, major insurance carrier
That engineer is right. You won't. No matter how thorough your test suite, production will surprise you. A caller who speaks two languages in the same sentence. A background noise pattern that's indistinguishable from speech. A conversation flow that technically completes but leaves the caller confused and unsatisfied. A prompt injection attempt you never considered.
The question isn't whether unknown unknowns will surface. It's whether you have the infrastructure to capture them, learn from them, and prevent them from happening again.
The Simulation-Monitoring Feedback Loop
This is the concept that Waymo pioneered for autonomous vehicles and that the most mature voice AI teams are now adopting. It works like this:
Capture: Production monitoring flags calls that deviate from expected patterns — unusual failure modes, edge cases, anomalous behavior.
Route: Flagged calls get automatically routed to human review queues for annotation. A human evaluator examines the call and labels what went wrong.
Analyze: The evaluation team looks for patterns across flagged calls. Are the same failure modes recurring? Is there a cluster of issues around a specific intent, accent, or time of day?
Generate: The identified patterns become new test cases in the development environment. The unknown unknown becomes a known test scenario.
Prevent: The new test cases get added to regression suites so the same failure mode gets caught before it reaches production again.
Repeat: The loop runs continuously. Every production failure makes the test suite stronger.
This is the evaluation gap closing in real time. Static voice AI systems degrade over time. Prompts that worked six months ago stop working because models update, user behavior shifts, and edge cases accumulate. Teams without feedback loops discover this slowly — through rising complaint volumes and declining CSAT scores. Teams with production-derived testing feeding back into development see the opposite: each month's production data makes next month's test suite stronger, and quality compounds instead of eroding.
Think of it like the Swiss cheese model from aviation safety. Each evaluation layer has holes (we've written about this stack-up in detail in our three-layer testing framework post):
Regression testing catches known functional issues — but only the ones you've thought of
Adversarial testing discovers environmental issues — but simulated adversity isn't real adversity
Compliance testing validates regulatory adherence — but assumes good-faith callers
Production monitoring catches everything else — but only after it's already happened once
The implementation primitive that holds this together is simulations: scripted personas, scenarios, and conditions that turn each layer of the stack into runnable tests.
Stack these layers and the holes stop aligning. A failure that slips through regression testing gets caught by production monitoring. The production monitoring flags it, routes it to human review, and it becomes a regression test. The loop closes.
The practical starting point: You don't need to build the full feedback loop on day one. Start with one thing: a human review queue for anomalous production calls. Route any call that triggers an anomaly flag — early hang-up, escalation, low confidence score, repeat caller — to a queue where a human can review it. Even reviewing 20-30 flagged calls a week will surface patterns that no amount of pre-production testing would have caught.
The Long Horizon
Let's zoom out.
You've built your base case evals. You've broadened into hard personas and deepened into edge cases. You've integrated evals into your CI/CD pipeline. You've calibrated your automated judges with human review. You've communicated the results across the organization. You're running evals on production traffic and feeding failures back into your test suite.
Now what?
The honest answer: you do it all again. For the next model. For the next feature. For the next market. For the next threat vector.
The 70% → 85% → 95% → 99% journey is not a one-time climb. Every new model version, every new prompt strategy, every new tool integration, every new language or vertical you expand into — each one resets part of the curve. Your evaluation infrastructure is what makes each subsequent climb shorter, faster, and less painful than the last.
The cost problem is real. It's too expensive to test all the things every time. An agent with 74 scenarios, 10 language variants, 4 difficulty levels, and 3 model versions represents 8,880 test combinations. Running all of them on every deploy would take hours and burn through your evaluation budget in a week. (For the inverse cost — what one production failure actually costs when you skip evaluation — see our breakdown of the $500K cost of skipping evaluation infrastructure.)
The art — and it is an art — is knowing which tests to trigger at which point in the development cycle:
Base case regression (always run, every deploy): The must-always, must-never, easy-mode suite. Fast, cheap, non-negotiable.
Current hills (active focus, nightly/weekly): The specific areas you're actively improving — a new language, a new intent, a tricky edge case. These get intensive coverage now and graduate to regression once solved.
Graduated tests (solved, in regression stable): Hills you've already climbed. They stay in the regression suite at reduced frequency — enough to catch regressions, not so much that they dominate the test budget.
This pattern recognition gets better at scale. When you've run millions of simulations across hundreds of agents in a dozen verticals, patterns emerge that no single team could see on their own. Certain failure modes cluster by vertical. Certain model updates cause predictable regressions. Certain conversation patterns are universally fragile regardless of use case. A platform that has seen the full picture can guide you to the tests that matter most for your specific situation — not because it has a magic algorithm, but because it has the data.
A fully extensible evaluation platform is ready for what comes next: new models that change behavior in subtle ways, new tools and integrations that add surface area, new compliance requirements that change what "correct" means, new threat vectors as bad actors get more creative, and new capabilities we haven't even imagined yet. The way to be ready is not to predict the future — it's to build learning systems that get better over time. Evaluation infrastructure that adapts as fast as the agents it's testing.
The bottom line:
It's possible to get to 70% performance out of the box with simple orchestration platforms. Getting to 85% is achievable with manual effort and dedication. Getting from 85% to 95% requires a diligent and iterative programmatic evaluation strategy. And getting from 95% to 99% is the hallowed ground of the most elite teams at the frontier of voice agent development.
Wherever you are on that curve, the important thing is that you're thinking about it. The teams that win aren't the ones that started with the best agent — they're the ones that built the best evaluation infrastructure. Because agents improve when you can measure them, and measurement compounds when you do it continuously.
If you want to talk about where you are on the maturity curve, we're here.
Frequently Asked Questions
How long does it take to implement evaluation infrastructure?
For a basic regression suite covering must-always and must-never checks, expect 1-2 weeks once you have a defined target agent and a small set of test scenarios. Reaching the 95% Stage 3 maturity (programmatic evals integrated into CI/CD with calibrated LLM-as-judge metrics) typically takes 2-3 months of disciplined work. Stage 4 — production feedback loops, cross-functional reporting, cost-optimized test orchestration — is a 6-12 month journey that compounds over time. The teams that move fastest don't try to build everything at once. They lock the regression floor first, then layer in adversarial coverage, then production-derived testing.
What's the minimum viable evaluation setup?
Two lists. What must your agent always do? What must it never do? Encode each as a binary metric, run it against 10-20 easy-mode personas (clear speech, neutral tone, single intent), and require 100% pass rate before you ship anything. That's it. Everything else — multi-difficulty personas, CI/CD gates, production monitoring — is layered on top of this foundation. Teams that try to skip this step and start with sophisticated metrics usually find themselves debugging the metrics instead of the agent.
How do I know if my evaluation suite is comprehensive enough?
Three signals. First: when you ship a prompt change, do you find out about regressions from your eval suite or from production complaints? If the answer is "production complaints," your suite has gaps. Second: do you have coverage for the conditions your real callers actually face — accents, background noise, multi-intent requests, frustrated emotional states? If your suite only tests clean-room conditions, you're not testing production. Third: when production surfaces a new failure mode, does it become a permanent test case within a week? If failures don't feed back into the suite, you're playing whack-a-mole rather than actually learning.
What's the ROI of voice AI evaluation?
The honest answer: the ROI is mostly defensive. Evaluation infrastructure prevents incidents that would otherwise cost six or seven figures in engineering time, customer remediation, regulatory exposure, and brand damage. A team running 200,000 calls per month at a 1% failure rate is producing 2,000 bad customer experiences every month. Cutting that failure rate in half is worth more than most teams realize. The offensive ROI shows up later: teams with mature evaluation can ship faster, expand into new languages and verticals with confidence, and turn quality data into product-roadmap insights that drive revenue. But the defensive case alone usually justifies the investment.
Should I build or buy evaluation tooling?
Build the evaluation rubrics yourself. They encode your domain expertise and your specific quality bar — nobody else can write them for you. Buy the infrastructure that runs them: simulation, observability, CI/CD integration, human review queues, cross-functional reporting. The rubrics are where your competitive advantage lives. The infrastructure is plumbing — building it from scratch typically takes 6-12 engineering-months and produces something less capable than off-the-shelf tooling. As one engineering leader put it: "If I can buy this, I don't need to build this."
Want to put these ideas into practice? Visit coval.ai for the tooling, or benchmarks.coval.ai for real-time voice AI performance data across providers.
See how Coval can help you improve your agents.
Book a call
