- Accuracy
- Latency
- Compliance
- Experience
Voice Agent Vendor Testing: How to Run a Bake-Off
How to run a voice AI vendor bake-off that picks the vendor your callers actually experience: the failure modes, the four shapes, and a defensible seven-step method.
Three vendors · one shared test suite · one decision
- Accuracy
- Latency
- Compliance
- Experience
- Accuracy
- Latency
- Compliance
- Experience
Voice agent vendor testing is the process of evaluating competing voice AI platforms against the same callers, the same scenarios, and the same scoring rubric. The goal is to find the vendor that creates the best experience for your customers, not the vendor with the best demo. Most teams skip this. They pick a vendor off a curated 20-minute demo, sign, and discover in production that the agent they bought is not the agent they saw. The stakes are real: a voice agent answers your phones, takes your orders, verifies your callers, and speaks in your brand voice to thousands of people you never hear from. Choosing the wrong one is expensive to unwind and slow to detect.
- A vendor bake-off compares platforms under identical conditions: the same simulated customers, the same test suite, the same metrics, the same iteration count. Anything less is a demo, not a test.
- Three failure modes pick the wrong vendor: demo theater, subjectivity winning over data, and the invisible long tail of conversations nobody can review by hand, including the adversarial calls a red-team test suite would have caught.
- The most reliable test suite is not imagined. It is mined from your own production or contact-center calls, then turned into a frequency-weighted scenario and persona matrix no vendor can pre-game.
- No single vendor wins everything. The value of a bake-off is seeing the tradeoffs clearly, then deciding against criteria you weighted before you heard a single pitch.
- An independent evaluation platform, like Coval, is the neutral measurement layer. It future-proofs you with vendor- and model-agnostic evaluations, and the test suite you build rolls forward into production monitoring once your chosen vendor is live. It is the instrument, not a contestant.
This guide is written for the person who owns the decision: the head of product, eval lead, or engineering leader who will sponsor or run a voice AI vendor evaluation. It draws on how rigorous bake-offs actually run across contact centers, restaurants, fintech, healthcare, and logistics, and on what Coval has seen sit between buyers and vendors as the neutral measurement layer. The method is vendor-agnostic. The point is to make the comparison fair, fast, and defensible enough that the decision survives a skeptical room.
What a voice agent vendor bake-off actually is
Two concepts run through the whole method, so it helps to define them before anything else. Simulated customers are the callers you generate to test each vendor: synthetic people who behave the way your real callers do. In Coval’s model these are personas, and personas are the how of a simulated customer, the behavior pattern they bring to the call: their accents, languages, interruption habits, emotional state, speaking tempo, and environment (a drive-thru lane, a quiet office, a car in traffic). The test set is the what: what each simulated customer is trying to accomplish on the call (“call to dispute a charge on order 12345”). Keeping the two separate lets you run one scenario across many caller types. The third ingredient is the metrics: how each call is graded and evaluated once it runs. Together, the simulated customers (the how), the test sets (the what), and the metrics (the grade) make up the shared test suite that every vendor runs. That same test suite is reusable: the one you build to pick a vendor is the one you roll forward into production monitoring once the vendor is live, so the work compounds instead of being thrown away. More on that below.
The framing that helps most: you would never pick a self-driving stack off a ten-minute ride. Waymo earned trust by running its software across millions of simulated miles covering rare and dangerous edge cases before a car ever carried a passenger on those routes, not by giving polished demos. According to Waymo’s writing on its simulation infrastructure, the company ships an update because it has passed exhaustive simulation, not because it worked once on a test drive. Voice AI is the same problem in a different medium. A voice agent meets the same long tail of accents, interruptions, background noise, and edge-case requests that a self-driving car meets in traffic. Evaluating a vendor off a demo is the ride-along. A bake-off is the simulation.
The difference matters because the demo and the deployment are rarely the same agent. Coval’s own observation across voice AI teams is blunt: roughly 95 percent of agents handle their demo flawlessly, and only about 62 percent survive the first week in production. That gap is Coval’s internal observation from working with voice teams, not a market statistic, and it is the entire reason a bake-off exists. The demo measures the base case under ideal conditions. Production measures the long tail under real ones. A vendor bake-off is how you move the measurement from the first to the second before you sign.
Why most vendor bake-offs pick the wrong vendor
Three failure modes show up again and again. Each one is a different way of measuring the demo instead of the deployment.
Vendors demo on cherry-picked base cases. The script is clean, the caller is cooperative, the audio is studio-quality, and the request is one the agent was tuned to handle. The buyer grades on impression: did the conversation feel smooth, did the voice sound good, did it do the thing on stage. None of that predicts what happens when a caller with a thick accent interrupts mid-sentence from a moving car.
Demo theater is what a demo is for, so calling it dishonest misses the point. The mistake is treating it as evidence. A demo is a sales artifact tuned to its best conditions. A bake-off replaces the vendor’s curated conditions with your real ones, runs every vendor through the same test suite, and grades on data instead of vibes.
This is the single most expensive failure mode, and it usually arrives late. The team runs a careful evaluation, the metrics point clearly at one vendor, and then a senior leader dials in to one call, dislikes the voice, and overrides the data. Months of structured work lose to a thirty-second gut reaction.
The fix is to keep judgment but weight it: decide on your criteria before anyone hears a vendor, naturalness and brand fit included, and make the leader’s preference one weighted input rather than a veto applied after the fact. When “the voice feels off” is a criterion you scored from the start, it competes fairly with accuracy and task completion. When it arrives as an afterthought, it quietly outranks everything you measured.
A real evaluation generates thousands of conversations across simulated customers, scenarios, and iterations. Nobody can review ten thousand calls by hand. So teams sample a few dozen, form an impression, and ship. The failure modes that live in the unreviewed long tail, the rare order, the load spike during a promotion, never get seen until they are live and a customer is on the line.
A bake-off grades every call automatically. The failures that a manual spot-check would miss cluster into patterns you can name and fix.
The adversarial cases hide in that same long tail. An adversarial or red-team test suite probes for the calls where someone is gaming the agent on purpose: a caller working the refund flow to extract a credit they are not owed, or coaxing the agent into disclosing account information it should never reveal. These are exactly the conversations a sample of a few dozen calls will never surface, because they are rare by design and damaging when they land. Coval’s three-layer testing framework names adversarial testing as its own layer alongside regression and production-derived testing, and a bake-off that skips it ships a vendor whose only proof against attack is that nobody tried during the demo.
This is where the demo-to-production gap is born. The conditions a small manual review misses are exactly the conditions production exposes at scale. The only way to see the long tail is to grade all of it automatically, against consistent metrics, the way Waymo grades simulated miles. That is the work a single QA lead cannot do by hand, and it is the work an evaluation platform exists to do.
The four shapes of a vendor bake-off
Not every bake-off has the same shape. The shape determines your scoring weights, your timeline, and your decision rule. Naming the shape up front keeps the evaluation honest, because each shape has a different way of going wrong. Vendors below are referred to generically: no real platform or customer is named.
The most common error across all four shapes is failing to name the shape, so the team applies the wrong decision rule. A head-to-head run as if it were an RFP wastes weeks on process. An incumbent displacement run as a clean head-to-head forgets to count the migration cost. Pick the shape, then pick the rule.
Bake-offs often run in rounds. When the starting field is wide, five or more vendors, you rarely run the full deep evaluation on all of them at once. A first round puts every vendor through a lighter version of the shared test suite to cull the field down to two or three finalists, then a deeper round runs the full suite, the full iteration counts, and the edge-case scenarios on just those finalists. The early round protects you from spending your hardest scenarios on vendors who were never going to make the cut. A multi-round structure also unlocks a scorecard dimension a single-pass bake-off misses: between rounds, give the finalists their scores and a chance to improve the agent on your feedback, then score how they responded. Responsiveness to issues and the pace of improvement on live calls is one of the strongest predictors of what the relationship will be like after you sign, and it is invisible in a one-shot test. A vendor who closes three of your five flagged failures in a week is telling you something a static scorecard cannot.
The method: how to run a defensible bake-off
A defensible bake-off is seven steps. The order matters: each step removes a way the result could later be dismissed.
Step 1. Define what “good” means before you hear a single vendor
Write down your criteria and weight them before any demo. The criteria that matter for most voice agents:
- Task completion — does the agent actually finish the job
- Accuracy — are the facts, prices, and account details right
- Latency — response speed, including time to first audio
- Escalation handling — does it hand off cleanly when it should
- Compliance — required disclosures, identity verification, consent
- Naturalness and brand fit — does it sound like you want to sound
- Cost — the per-minute or per-call economics at your volume
Weight them to your business. A healthcare intake agent weights compliance and accuracy heavily; a drive-thru agent weights task completion and latency. The point of doing this first is that no vendor’s demo can move the goalposts. When a vendor dazzles on a dimension you weighted low, the weight protects you from being talked into reweighting on the spot.
Step 2. Mine your production calls to ground the test suite in reality
This is the step that separates a real test from an imagined one. If you already run an agent, or a human contact center, do not brainstorm your test set. Derive it. The recommended move is to ingest a representative sample of production calls, on the order of 1,000 to 10,000, and categorize them to extract three things:
- A use-case and scenario taxonomy. What callers actually call about, frequency-weighted. This tells you what to test and how to weight it: the common cases that make up most of your volume, and the risky long tail that makes up most of your failures. It surfaces scenarios you would never have thought to write down.
- An agent-behavior taxonomy. Where agents actually break: bailing early, hallucinating a policy, skipping a readback, mishandling a transfer, getting stuck in a loop. These become your metrics and your ALWAYS / NEVER / SOMETIMES expected behaviors, and they are exactly the long-tail failure modes a manual review of a few dozen calls will never catch.
- Persona realism. The real distribution of your callers, so your simulated customers mirror your callers instead of a clean studio recording: accents and languages, background environments (a drive-thru lane, a quiet office, a car in traffic, a noisy cafe), emotional states, interruption habits, speaking tempo, and age.
The output is a frequency-weighted scenario and persona matrix that becomes the shared test suite every vendor runs. Now the bake-off measures vendors against your reality, not a vendor’s curated demo, and it is a test suite no vendor can pre-game because it came from your calls.
Doing this by hand across thousands of calls is the herculean review effort buyers dread, and it is where most bake-offs quietly give up and fall back to vibes. This is where an evaluation platform earns its place: an evaluation platform, like Coval, automates the ingest-and-categorize step and turns the call corpus into a reusable test suite of test sets and personas. That is how the neutral-arbiter approach becomes systematic instead of a one-off heroic project. It is a leading example of the category, not the only way to do it, but it is the difference between a test suite you can rebuild every quarter and one you build once and never touch again.
Greenfield fallback. If you have no production data yet, synthesize the test suite from domain expertise, the common caller archetypes for your use case, and industry patterns. Then backfill from real calls the moment your agent is live. A synthesized test suite is a starting point; a mined one is the goal.
Step 3. Build one shared test suite
Turn the matrix into concrete simulated customers and scenarios. As defined up front, a persona is the how of a simulated customer (impatient, interrupts often, speaks slowly with filler words, calls from a restaurant), and a test set is the what they are trying to do (“call to dispute a charge on order 12345”). Keeping them separate lets you run one scenario across many caller types. Include the three categories explicitly:
- ALWAYS behaviors: things the agent must do every time (verify identity before disclosing account details, read back the order before confirming).
- NEVER behaviors: things the agent must never do (quote a price it cannot honor, proceed without consent on a recorded line).
- SOMETIMES behaviors: context-dependent calls (escalate to a human when the caller is distressed, offer an upsell only when the order is complete).
The fairness rule is non-negotiable: every vendor runs through identical simulated customers, identical test sets, identical metrics, and identical iteration counts. The moment one vendor gets an easier test suite, the comparison is dead.
Step 4. Translate requirements into metrics
Criteria are opinions until they are metrics, and a common mistake is to picture a metric as a single LLM judge stamping pass or fail on a transcript. Coval scores on five families of metrics, and the spread is what lets a scorecard capture the real tradeoffs between vendors instead of one blurry number.
- Deterministic metrics run on rules with no model inference, so they are fast, cheap, and exact. They cover the hard compliance checks a bake-off lives on: did the agent reach the required end state, match the expected output, invoke the right tool with the right parameters, or hit a regex for an exact required phrase. They also catch basic failures like the agent never responding or needing to be reprompted.
- Statistical metrics measure the signal-level reality a demo hides: latency, time to first audio, interruption rate, speech tempo, loop detection, background noise, and voice-quality signals like pitch variability and vocal fry. This is where the agent that sounded great on stage reveals that it talks over callers or stalls under load.
- ML-model metrics use purpose-built models for what rules cannot judge: transcript and audio sentiment, transcription error rate, and timbre drift across a call.
- LLM-judge metrics evaluate meaning. They come as binary (a yes/no behavior, written with explicit YES and NO conditions so two graders agree), numerical (a scored scale like empathy or technical accuracy), categorical (bucketing a call as resolved, escalated, or abandoned), and composite (one verdict built from several expected behaviors). Each has an audio variant that judges the recording itself, for qualities you cannot read off a transcript like vocal tone, clarity, and whether the agent talked over the caller.
- Trace metrics read the agent’s own execution spans, exposing how it ran underneath the words: tool-call counts, LLM token usage, component-level time to first byte across the model, the speech-to-text, and the text-to-speech, and word error rate. This is how you tell whether one vendor is slow because of its model, its transcription, or its speech synthesis, which matters when the platforms you are comparing are built on different stacks.
Two capabilities sit on top. Workflow verification checks the transcript against the call flow you defined, so you can score whether each vendor followed the required steps in order rather than only whether it ended in the right place. And because LLM judges are not perfectly consistent out of the box, the metric-improvement workflow runs a draft metric across your transcripts, shows how often it returns each result, and lets you tighten it against the cases where graders disagree before you trust it to rank vendors. You can define custom metrics in any of these families against your own criteria. Calibrate the judge first: a metric two reviewers would score differently cannot rank two vendors.
Step 5. Run every vendor under identical conditions
Same test suite, same metrics, same iteration counts, run in parallel. This is the step the demo skips and the step an evaluation platform makes trivial: pointing the same simulated customers and test sets at three different agent endpoints and scoring all of it with the same judges. Running each vendor a single time tells you almost nothing; voice agents are stochastic, and one good or bad call is noise. Running each vendor across many iterations of the same test suite turns noise into a distribution you can compare.
Step 6. Score honestly
Here is the truth every real bake-off lands on: no single vendor wins everything. The value of the exercise is not crowning a winner. It is seeing the tradeoffs clearly enough to make a weighted decision. One vendor is the most accurate but drops complex orders. Another delivers the best caller experience but quotes the wrong price. A third completes the task and gets the facts right but sounds robotic and skips the confirmation step. The scorecard makes those tradeoffs legible. It does not make the decision for you.
When the bake-off runs in rounds, the scorecard also carries the responsiveness dimension from earlier: how many of your flagged failures each finalist fixed between rounds, and how fast. That row often separates two vendors who looked even on the static metrics.
A note on what not to publish: a bake-off scorecard is yours, built on your test suite, and it is honest only for your weights and your conditions. Per-vendor numbers from someone else’s bake-off are not transferable, so treat any cross-vendor performance table you see in the wild with suspicion unless you can see the test suite behind it.
Step 7. Decide
Translate the scorecard into a decision with a rule that matches your shape:
- Clear winner: one vendor leads on the heavily weighted criteria. Sign.
- Split decision: vendors trade wins across criteria. Apply your weights and pick; the weights you set in Step 1 do the work here, which is exactly why you set them first.
- Pilot both: two vendors are close and the remaining uncertainty is about real traffic. Run a limited production pilot and decide on live data, scored with the same metrics.
- None qualify: no vendor clears your bar. This is a real outcome. It points you toward a build-vs-buy conversation or a revised set of requirements, not a forced signature.
What this looks like in practice: a drive-thru bake-off
A composite walkthrough makes the method concrete. The vendors are A, B, and C; the numbers below are illustrative of how a scorecard reads, not a measurement of any real platform.
A quick-service restaurant chain wants a voice agent for its drive-thru lane. The team starts with five vendors, so it runs a first round on a lighter test suite to cull the field to three finalists, then runs the full evaluation below on those three. To build the test suite, the team mines a sample of recorded lane orders and finds three things:
- Scenario taxonomy. Simple orders dominate volume, but combo customizations, item substitutions, and “actually, change that” mid-order corrections are where money and patience are lost.
- Behavior taxonomy. Human staff and the current system fail most on multi-item modifications and on confirming the final total.
- Persona realism. Heavy background noise, a wide accent range, impatient callers, frequent interruptions, and engine noise on every call.
That becomes the shared test suite. Simulated customers (personas): the impatient regular who interrupts, the first-time caller who needs the menu explained, the non-native speaker, the caller in a loud truck. Test sets: a base-case single combo, a four-item family order with two substitutions, a mid-order correction, an item not on the menu, and a coupon the agent has to validate. Metrics: order accuracy (binary against the POS), total-confirmation readback (binary, an ALWAYS behavior), task completion (categorical), latency and time-to-first-audio (audio), and naturalness (audio judge). Every vendor runs the identical test suite across many iterations.
The scorecard comes back like this.
No row crowns a single vendor. A looks accurate in aggregate but cannot be trusted on the complex orders that drive margin: expand it and the four-item family order and the mid-order correction both fail. B delivers the experience customers will love but quotes prices the chain cannot honor, which is a legal and trust problem, not a polish problem. C is dependable on the numbers but skips a required readback and sounds like a machine, though it was the fastest to close the issues the team flagged between rounds, which says something about the relationship to come. The decision now depends entirely on the weights set in Step 1. If price discipline and complex-order accuracy are weighted highest, B is out despite the best demo, and the choice is between A and C with a remediation plan for each, with C’s pace of improvement counting in its favor. The scorecard made the tradeoffs impossible to ignore rather than picking the winner for the team, which is the whole point.
Pitfalls that quietly ruin a bake-off
Even a well-run bake-off can be undermined. Watch for these.
- Gaming the test set. If a vendor sees the test suite in advance, they tune to it and the result is theater again. Keep the test suite private until the run, and refresh it from new production calls each cycle so it cannot be memorized.
- Weighting theater. Setting criteria weights after seeing the results, or quietly reweighting to justify a favored vendor. The weights have to be locked before the run and changed only with a documented reason.
- Reverting to vibes anyway. Running the whole rigorous process and then letting one bad call from a senior stakeholder override it. This is failure mode 2 from earlier, and the antidote is the same: name naturalness as a weighted criterion up front so taste competes fairly instead of vetoing late.
- Moving too slow. A bake-off that takes a quarter loses the window. The cautionary case is the team that could not run two evaluations in parallel, dragged the process across months, and lost the deployment slot entirely. Speed is a feature of the method: parallel runs and automated scoring are what keep a rigorous bake-off from collapsing under its own weight.
What happens after you select a vendor
The most underrated benefit of running a rigorous bake-off only shows up after you sign: the work does not get thrown away. The same metrics and test sets you built to choose a vendor are the ones you want pointed at that vendor in production. The moment the chosen agent goes live, the selection test suite becomes your regression suite and your production-monitoring suite. Every metric you wrote to score the bake-off keeps scoring real calls. Every persona and scenario you mined becomes a regression case you re-run before each release, so a fix for one caller does not quietly break another.
This is where an independent evaluation platform pays off twice. Because the platform is the neutral layer and not the vendor, the evaluation is vendor- and model-agnostic: if you swap models later, or change vendors entirely, the same test suite and the same metrics still apply, so you are not rebuilding your quality bar from zero each time. And because the platform does monitoring and human review alongside simulation, the bake-off rolls straight into production observability: Coval keeps running your metrics on live conversations, flags regressions against the same bar you set during selection, and routes the calls that need a person to a human-review queue. The selection exercise and the ongoing quality program are the same artifact, which is the opposite of the typical pattern where a buyer builds an elaborate evaluation, picks a vendor, and then starts over from scratch on monitoring. The three-layer testing framework covers how the same scenarios serve selection and then regression, adversarial, and production-derived testing once you are live.
Where to go from here
A defensible voice agent vendor bake-off is not complicated, but it is real work: name the shape, weight the criteria before you listen to a pitch, mine your calls into a test suite, turn requirements into metrics, run every vendor identically (in rounds if the field is wide), score honestly, and decide against the weights you set first. Done this way, the decision survives a skeptical room because every step removed a way it could be dismissed, and the test suite you built keeps working for you after you sign.
The part most teams underestimate is the volume. Mining thousands of calls and grading thousands of conversations across vendors is exactly the work a single reviewer cannot do by hand and exactly the work an evaluation platform exists to do. Coval is the neutral measurement layer that sits between the buyer and the vendors: it ingests the calls, builds the test suite, runs every vendor through identical simulations, grades them with consistent metrics, and routes the calls that need a person to a human review queue. It is the instrument that makes the bake-off objective, not one of the contestants.
If you want the methodology behind the metrics, read the voice AI agent evaluation guide. If you are deciding whether to build your own evaluation harness or buy one for the bake-off, the build vs. buy framework covers that tradeoff. To understand why the demo and the deployment diverge, see why voice AI agents break in production and the five metrics that predict production success. For the persona-realism piece, accents, dialects, and multilingual testing covers the blind spot most test suites miss, and the three-layer testing framework shows how a bake-off test suite becomes ongoing regression coverage after you sign.