Voice Agent Vendor Testing: How to Run a Bake-Off

How to run a voice AI vendor bake-off that picks the vendor your callers actually experience: the failure modes, the four shapes, and a defensible seven-step method.

Voice agent vendor testing is the process of evaluating competing voice AI platforms against the same callers, the same scenarios, and the same scoring rubric. The goal is to find the vendor that creates the best experience for your customers, not the vendor with the best demo. Most teams skip this. They pick a vendor off a curated 20-minute demo, sign, and discover in production that the agent they bought is not the agent they saw. The stakes are real: a voice agent answers your phones, takes your orders, verifies your callers, and speaks in your brand voice to thousands of people you never hear from. Choosing the wrong one is expensive to unwind and slow to detect.

Key takeaways
  • A vendor bake-off compares platforms under identical conditions: the same simulated customers, the same test suite, the same metrics, the same iteration count. Anything less is a demo, not a test.
  • Three failure modes pick the wrong vendor: demo theater, subjectivity winning over data, and the invisible long tail of conversations nobody can review by hand, including the adversarial calls a red-team test suite would have caught.
  • The most reliable test suite is not imagined. It is mined from your own production or contact-center calls, then turned into a frequency-weighted scenario and persona matrix no vendor can pre-game.
  • No single vendor wins everything. The value of a bake-off is seeing the tradeoffs clearly, then deciding against criteria you weighted before you heard a single pitch.
  • An independent evaluation platform, like Coval, is the neutral measurement layer. It future-proofs you with vendor- and model-agnostic evaluations, and the test suite you build rolls forward into production monitoring once your chosen vendor is live. It is the instrument, not a contestant.

This guide is written for the person who owns the decision: the head of product, eval lead, or engineering leader who will sponsor or run a voice AI vendor evaluation. It draws on how rigorous bake-offs actually run across contact centers, restaurants, fintech, healthcare, and logistics, and on what Coval has seen sit between buyers and vendors as the neutral measurement layer. The method is vendor-agnostic. The point is to make the comparison fair, fast, and defensible enough that the decision survives a skeptical room.

Want the whole method in one place? Download the playbook PDF and the Claude Code skill to scope and run a bake-off yourself, without reading another word.
Get the playbook

What a voice agent vendor bake-off actually is

Definition
Voice agent vendor bake-off: a structured evaluation that runs two or more voice AI platforms through one shared test suite of simulated customers, scenarios, and metrics under identical conditions, so the platforms are scored on the same evidence and the decision is reproducible.

Two concepts run through the whole method, so it helps to define them before anything else. Simulated customers are the callers you generate to test each vendor: synthetic people who behave the way your real callers do. In Coval’s model these are personas, and personas are the how of a simulated customer, the behavior pattern they bring to the call: their accents, languages, interruption habits, emotional state, speaking tempo, and environment (a drive-thru lane, a quiet office, a car in traffic). The test set is the what: what each simulated customer is trying to accomplish on the call (“call to dispute a charge on order 12345”). Keeping the two separate lets you run one scenario across many caller types. The third ingredient is the metrics: how each call is graded and evaluated once it runs. Together, the simulated customers (the how), the test sets (the what), and the metrics (the grade) make up the shared test suite that every vendor runs. That same test suite is reusable: the one you build to pick a vendor is the one you roll forward into production monitoring once the vendor is live, so the work compounds instead of being thrown away. More on that below.

PersonaThe howHow the simulated customer behaves: accent, interruptions, emotion, environment.
+
Test setThe whatWhat they are trying to do on the call: "dispute a charge on order 12345."
+
MetricsThe gradeHow each call is scored and evaluated once it runs.
=
Test suiteWhat every vendor runsIdentical for all vendors, and reused in production.

The framing that helps most: you would never pick a self-driving stack off a ten-minute ride. Waymo earned trust by running its software across millions of simulated miles covering rare and dangerous edge cases before a car ever carried a passenger on those routes, not by giving polished demos. According to Waymo’s writing on its simulation infrastructure, the company ships an update because it has passed exhaustive simulation, not because it worked once on a test drive. Voice AI is the same problem in a different medium. A voice agent meets the same long tail of accents, interruptions, background noise, and edge-case requests that a self-driving car meets in traffic. Evaluating a vendor off a demo is the ride-along. A bake-off is the simulation.

The difference matters because the demo and the deployment are rarely the same agent. Coval’s own observation across voice AI teams is blunt: roughly 95 percent of agents handle their demo flawlessly, and only about 62 percent survive the first week in production. That gap is Coval’s internal observation from working with voice teams, not a market statistic, and it is the entire reason a bake-off exists. The demo measures the base case under ideal conditions. Production measures the long tail under real ones. A vendor bake-off is how you move the measurement from the first to the second before you sign.

Why most vendor bake-offs pick the wrong vendor

Three failure modes show up again and again. Each one is a different way of measuring the demo instead of the deployment.

Failure mode 1Demo theater

Vendors demo on cherry-picked base cases. The script is clean, the caller is cooperative, the audio is studio-quality, and the request is one the agent was tuned to handle. The buyer grades on impression: did the conversation feel smooth, did the voice sound good, did it do the thing on stage. None of that predicts what happens when a caller with a thick accent interrupts mid-sentence from a moving car.

Demo theater is what a demo is for, so calling it dishonest misses the point. The mistake is treating it as evidence. A demo is a sales artifact tuned to its best conditions. A bake-off replaces the vendor’s curated conditions with your real ones, runs every vendor through the same test suite, and grades on data instead of vibes.

Failure mode 2Subjectivity wins

This is the single most expensive failure mode, and it usually arrives late. The team runs a careful evaluation, the metrics point clearly at one vendor, and then a senior leader dials in to one call, dislikes the voice, and overrides the data. Months of structured work lose to a thirty-second gut reaction.

The fix is to keep judgment but weight it: decide on your criteria before anyone hears a vendor, naturalness and brand fit included, and make the leader’s preference one weighted input rather than a veto applied after the fact. When “the voice feels off” is a criterion you scored from the start, it competes fairly with accuracy and task completion. When it arrives as an afterthought, it quietly outranks everything you measured.

Failure mode 3The invisible 10,000 conversations

A real evaluation generates thousands of conversations across simulated customers, scenarios, and iterations. Nobody can review ten thousand calls by hand. So teams sample a few dozen, form an impression, and ship. The failure modes that live in the unreviewed long tail, the rare order, the load spike during a promotion, never get seen until they are live and a customer is on the line.

~10,000 calls gradedPattern: dropped readbackPattern: latency spike under load

A bake-off grades every call automatically. The failures that a manual spot-check would miss cluster into patterns you can name and fix.

The adversarial cases hide in that same long tail. An adversarial or red-team test suite probes for the calls where someone is gaming the agent on purpose: a caller working the refund flow to extract a credit they are not owed, or coaxing the agent into disclosing account information it should never reveal. These are exactly the conversations a sample of a few dozen calls will never surface, because they are rare by design and damaging when they land. Coval’s three-layer testing framework names adversarial testing as its own layer alongside regression and production-derived testing, and a bake-off that skips it ships a vendor whose only proof against attack is that nobody tried during the demo.

This is where the demo-to-production gap is born. The conditions a small manual review misses are exactly the conditions production exposes at scale. The only way to see the long tail is to grade all of it automatically, against consistent metrics, the way Waymo grades simulated miles. That is the work a single QA lead cannot do by hand, and it is the work an evaluation platform exists to do.

The four shapes of a vendor bake-off

Not every bake-off has the same shape. The shape determines your scoring weights, your timeline, and your decision rule. Naming the shape up front keeps the evaluation honest, because each shape has a different way of going wrong. Vendors below are referred to generically: no real platform or customer is named.

Shape 01Formal enterprise RFPA multi-phase funnel that narrows a wide field down to a small finalist set, with the technical bake-off as the last phase.
WhenLarge, regulated, or high-spend deployments where procurement requires a paper trail.TrapScoring the paperwork instead of the agent; the bake-off becomes a rubber stamp on a decision already made on slideware.
Shape 02Build vs. buyAn internal or incumbent build measured against an external challenger.
WhenYou already run an agent or a script and want to know whether a vendor beats what you have.TrapGrading the build on sunk cost and familiarity rather than on the same test suite the challenger runs.
Shape 03Head-to-headTwo or three finalists decided on the merits, no incumbent.
WhenYou have narrowed to two or three credible platforms and need a tiebreaker grounded in evidence.TrapLetting a single dimension (latency, price, one good call) settle a decision that should be weighted across criteria.
Shape 04Incumbent displacementA challenger must clear a bar set by an entrenched, already-deployed vendor.
WhenA live vendor is underperforming and you want to know if switching is worth the migration cost.TrapHolding the challenger to a higher bar than the incumbent ever passed; the incumbent's real production failures are the baseline.

The most common error across all four shapes is failing to name the shape, so the team applies the wrong decision rule. A head-to-head run as if it were an RFP wastes weeks on process. An incumbent displacement run as a clean head-to-head forgets to count the migration cost. Pick the shape, then pick the rule.

Bake-offs often run in rounds. When the starting field is wide, five or more vendors, you rarely run the full deep evaluation on all of them at once. A first round puts every vendor through a lighter version of the shared test suite to cull the field down to two or three finalists, then a deeper round runs the full suite, the full iteration counts, and the edge-case scenarios on just those finalists. The early round protects you from spending your hardest scenarios on vendors who were never going to make the cut. A multi-round structure also unlocks a scorecard dimension a single-pass bake-off misses: between rounds, give the finalists their scores and a chance to improve the agent on your feedback, then score how they responded. Responsiveness to issues and the pace of improvement on live calls is one of the strongest predictors of what the relationship will be like after you sign, and it is invisible in a one-shot test. A vendor who closes three of your five flagged failures in a week is telling you something a static scorecard cannot.

Want a second set of eyes on your criteria? A Coval solutions engineer can help you weight what matters, mine your calls into a test suite, and run every vendor identically.
Talk to a solutions engineer

The method: how to run a defensible bake-off

A defensible bake-off is seven steps. The order matters: each step removes a way the result could later be dismissed.

Step 1
Define what "good" means first
Write your criteria and weight them before any demo, so no vendor's pitch can move the goalposts later.
Step 2
Mine your production calls
Ingest 1,000 to 10,000 real calls and categorize them into a scenario, behavior, and persona-realism taxonomy, so the test comes from your reality.
Step 3
Build one shared test suite
Turn the matrix into personas and test sets with explicit ALWAYS / NEVER / SOMETIMES behaviors, identical for every vendor.
Step 4
Translate requirements into metrics
Map each criterion onto Coval's five metric families, then calibrate the judges before they are allowed to rank vendors.
Step 5
Run every vendor identically
Same suite, same metrics, same iteration counts, in parallel, so one good or bad call is noise instead of a verdict.
Step 6
Score honestly
No single vendor wins everything; the scorecard exists to make the tradeoffs legible, not to crown a winner for you.
Step 7
Decide
Apply the weights you set in Step 1: clear winner, split decision, pilot both, or none qualify.

Step 1. Define what “good” means before you hear a single vendor

Write down your criteria and weight them before any demo. The criteria that matter for most voice agents:

  • Task completion — does the agent actually finish the job
  • Accuracy — are the facts, prices, and account details right
  • Latency — response speed, including time to first audio
  • Escalation handling — does it hand off cleanly when it should
  • Compliance — required disclosures, identity verification, consent
  • Naturalness and brand fit — does it sound like you want to sound
  • Cost — the per-minute or per-call economics at your volume

Weight them to your business. A healthcare intake agent weights compliance and accuracy heavily; a drive-thru agent weights task completion and latency. The point of doing this first is that no vendor’s demo can move the goalposts. When a vendor dazzles on a dimension you weighted low, the weight protects you from being talked into reweighting on the spot.

Step 2. Mine your production calls to ground the test suite in reality

This is the step that separates a real test from an imagined one. If you already run an agent, or a human contact center, do not brainstorm your test set. Derive it. The recommended move is to ingest a representative sample of production calls, on the order of 1,000 to 10,000, and categorize them to extract three things:

  1. A use-case and scenario taxonomy. What callers actually call about, frequency-weighted. This tells you what to test and how to weight it: the common cases that make up most of your volume, and the risky long tail that makes up most of your failures. It surfaces scenarios you would never have thought to write down.
  2. An agent-behavior taxonomy. Where agents actually break: bailing early, hallucinating a policy, skipping a readback, mishandling a transfer, getting stuck in a loop. These become your metrics and your ALWAYS / NEVER / SOMETIMES expected behaviors, and they are exactly the long-tail failure modes a manual review of a few dozen calls will never catch.
  3. Persona realism. The real distribution of your callers, so your simulated customers mirror your callers instead of a clean studio recording: accents and languages, background environments (a drive-thru lane, a quiet office, a car in traffic, a noisy cafe), emotional states, interruption habits, speaking tempo, and age.

The output is a frequency-weighted scenario and persona matrix that becomes the shared test suite every vendor runs. Now the bake-off measures vendors against your reality, not a vendor’s curated demo, and it is a test suite no vendor can pre-game because it came from your calls.

Doing this by hand across thousands of calls is the herculean review effort buyers dread, and it is where most bake-offs quietly give up and fall back to vibes. This is where an evaluation platform earns its place: an evaluation platform, like Coval, automates the ingest-and-categorize step and turns the call corpus into a reusable test suite of test sets and personas. That is how the neutral-arbiter approach becomes systematic instead of a one-off heroic project. It is a leading example of the category, not the only way to do it, but it is the difference between a test suite you can rebuild every quarter and one you build once and never touch again.

Greenfield fallback. If you have no production data yet, synthesize the test suite from domain expertise, the common caller archetypes for your use case, and industry patterns. Then backfill from real calls the moment your agent is live. A synthesized test suite is a starting point; a mined one is the goal.

Step 3. Build one shared test suite

Turn the matrix into concrete simulated customers and scenarios. As defined up front, a persona is the how of a simulated customer (impatient, interrupts often, speaks slowly with filler words, calls from a restaurant), and a test set is the what they are trying to do (“call to dispute a charge on order 12345”). Keeping them separate lets you run one scenario across many caller types. Include the three categories explicitly:

  • ALWAYS behaviors: things the agent must do every time (verify identity before disclosing account details, read back the order before confirming).
  • NEVER behaviors: things the agent must never do (quote a price it cannot honor, proceed without consent on a recorded line).
  • SOMETIMES behaviors: context-dependent calls (escalate to a human when the caller is distressed, offer an upsell only when the order is complete).

The fairness rule is non-negotiable: every vendor runs through identical simulated customers, identical test sets, identical metrics, and identical iteration counts. The moment one vendor gets an easier test suite, the comparison is dead.

Step 4. Translate requirements into metrics

Criteria are opinions until they are metrics, and a common mistake is to picture a metric as a single LLM judge stamping pass or fail on a transcript. Coval scores on five families of metrics, and the spread is what lets a scorecard capture the real tradeoffs between vendors instead of one blurry number.

  • Deterministic metrics run on rules with no model inference, so they are fast, cheap, and exact. They cover the hard compliance checks a bake-off lives on: did the agent reach the required end state, match the expected output, invoke the right tool with the right parameters, or hit a regex for an exact required phrase. They also catch basic failures like the agent never responding or needing to be reprompted.
  • Statistical metrics measure the signal-level reality a demo hides: latency, time to first audio, interruption rate, speech tempo, loop detection, background noise, and voice-quality signals like pitch variability and vocal fry. This is where the agent that sounded great on stage reveals that it talks over callers or stalls under load.
  • ML-model metrics use purpose-built models for what rules cannot judge: transcript and audio sentiment, transcription error rate, and timbre drift across a call.
  • LLM-judge metrics evaluate meaning. They come as binary (a yes/no behavior, written with explicit YES and NO conditions so two graders agree), numerical (a scored scale like empathy or technical accuracy), categorical (bucketing a call as resolved, escalated, or abandoned), and composite (one verdict built from several expected behaviors). Each has an audio variant that judges the recording itself, for qualities you cannot read off a transcript like vocal tone, clarity, and whether the agent talked over the caller.
  • Trace metrics read the agent’s own execution spans, exposing how it ran underneath the words: tool-call counts, LLM token usage, component-level time to first byte across the model, the speech-to-text, and the text-to-speech, and word error rate. This is how you tell whether one vendor is slow because of its model, its transcription, or its speech synthesis, which matters when the platforms you are comparing are built on different stacks.

Two capabilities sit on top. Workflow verification checks the transcript against the call flow you defined, so you can score whether each vendor followed the required steps in order rather than only whether it ended in the right place. And because LLM judges are not perfectly consistent out of the box, the metric-improvement workflow runs a draft metric across your transcripts, shows how often it returns each result, and lets you tighten it against the cases where graders disagree before you trust it to rank vendors. You can define custom metrics in any of these families against your own criteria. Calibrate the judge first: a metric two reviewers would score differently cannot rank two vendors.

Step 5. Run every vendor under identical conditions

Same test suite, same metrics, same iteration counts, run in parallel. This is the step the demo skips and the step an evaluation platform makes trivial: pointing the same simulated customers and test sets at three different agent endpoints and scoring all of it with the same judges. Running each vendor a single time tells you almost nothing; voice agents are stochastic, and one good or bad call is noise. Running each vendor across many iterations of the same test suite turns noise into a distribution you can compare.

Step 6. Score honestly

Here is the truth every real bake-off lands on: no single vendor wins everything. The value of the exercise is not crowning a winner. It is seeing the tradeoffs clearly enough to make a weighted decision. One vendor is the most accurate but drops complex orders. Another delivers the best caller experience but quotes the wrong price. A third completes the task and gets the facts right but sounds robotic and skips the confirmation step. The scorecard makes those tradeoffs legible. It does not make the decision for you.

When the bake-off runs in rounds, the scorecard also carries the responsiveness dimension from earlier: how many of your flagged failures each finalist fixed between rounds, and how fast. That row often separates two vendors who looked even on the static metrics.

A note on what not to publish: a bake-off scorecard is yours, built on your test suite, and it is honest only for your weights and your conditions. Per-vendor numbers from someone else’s bake-off are not transferable, so treat any cross-vendor performance table you see in the wild with suspicion unless you can see the test suite behind it.

Step 7. Decide

Translate the scorecard into a decision with a rule that matches your shape:

  • Clear winner: one vendor leads on the heavily weighted criteria. Sign.
  • Split decision: vendors trade wins across criteria. Apply your weights and pick; the weights you set in Step 1 do the work here, which is exactly why you set them first.
  • Pilot both: two vendors are close and the remaining uncertainty is about real traffic. Run a limited production pilot and decide on live data, scored with the same metrics.
  • None qualify: no vendor clears your bar. This is a real outcome. It points you toward a build-vs-buy conversation or a revised set of requirements, not a forced signature.
Ready to run this yourself? Download the playbook and the Claude Code skill, and put the seven-step method to work on your own vendors.
Get the playbook

What this looks like in practice: a drive-thru bake-off

A composite walkthrough makes the method concrete. The vendors are A, B, and C; the numbers below are illustrative of how a scorecard reads, not a measurement of any real platform.

A quick-service restaurant chain wants a voice agent for its drive-thru lane. The team starts with five vendors, so it runs a first round on a lighter test suite to cull the field to three finalists, then runs the full evaluation below on those three. To build the test suite, the team mines a sample of recorded lane orders and finds three things:

  • Scenario taxonomy. Simple orders dominate volume, but combo customizations, item substitutions, and “actually, change that” mid-order corrections are where money and patience are lost.
  • Behavior taxonomy. Human staff and the current system fail most on multi-item modifications and on confirming the final total.
  • Persona realism. Heavy background noise, a wide accent range, impatient callers, frequent interruptions, and engine noise on every call.

That becomes the shared test suite. Simulated customers (personas): the impatient regular who interrupts, the first-time caller who needs the menu explained, the non-native speaker, the caller in a loud truck. Test sets: a base-case single combo, a four-item family order with two substitutions, a mid-order correction, an item not on the menu, and a coupon the agent has to validate. Metrics: order accuracy (binary against the POS), total-confirmation readback (binary, an ALWAYS behavior), task completion (categorical), latency and time-to-first-audio (audio), and naturalness (audio judge). Every vendor runs the identical test suite across many iterations.

The scorecard comes back like this.

No row crowns a single vendor. A looks accurate in aggregate but cannot be trusted on the complex orders that drive margin: expand it and the four-item family order and the mid-order correction both fail. B delivers the experience customers will love but quotes prices the chain cannot honor, which is a legal and trust problem, not a polish problem. C is dependable on the numbers but skips a required readback and sounds like a machine, though it was the fastest to close the issues the team flagged between rounds, which says something about the relationship to come. The decision now depends entirely on the weights set in Step 1. If price discipline and complex-order accuracy are weighted highest, B is out despite the best demo, and the choice is between A and C with a remediation plan for each, with C’s pace of improvement counting in its favor. The scorecard made the tradeoffs impossible to ignore rather than picking the winner for the team, which is the whole point.

Pitfalls that quietly ruin a bake-off

Even a well-run bake-off can be undermined. Watch for these.

  • Gaming the test set. If a vendor sees the test suite in advance, they tune to it and the result is theater again. Keep the test suite private until the run, and refresh it from new production calls each cycle so it cannot be memorized.
  • Weighting theater. Setting criteria weights after seeing the results, or quietly reweighting to justify a favored vendor. The weights have to be locked before the run and changed only with a documented reason.
  • Reverting to vibes anyway. Running the whole rigorous process and then letting one bad call from a senior stakeholder override it. This is failure mode 2 from earlier, and the antidote is the same: name naturalness as a weighted criterion up front so taste competes fairly instead of vetoing late.
  • Moving too slow. A bake-off that takes a quarter loses the window. The cautionary case is the team that could not run two evaluations in parallel, dragged the process across months, and lost the deployment slot entirely. Speed is a feature of the method: parallel runs and automated scoring are what keep a rigorous bake-off from collapsing under its own weight.

What happens after you select a vendor

The most underrated benefit of running a rigorous bake-off only shows up after you sign: the work does not get thrown away. The same metrics and test sets you built to choose a vendor are the ones you want pointed at that vendor in production. The moment the chosen agent goes live, the selection test suite becomes your regression suite and your production-monitoring suite. Every metric you wrote to score the bake-off keeps scoring real calls. Every persona and scenario you mined becomes a regression case you re-run before each release, so a fix for one caller does not quietly break another.

This is where an independent evaluation platform pays off twice. Because the platform is the neutral layer and not the vendor, the evaluation is vendor- and model-agnostic: if you swap models later, or change vendors entirely, the same test suite and the same metrics still apply, so you are not rebuilding your quality bar from zero each time. And because the platform does monitoring and human review alongside simulation, the bake-off rolls straight into production observability: Coval keeps running your metrics on live conversations, flags regressions against the same bar you set during selection, and routes the calls that need a person to a human-review queue. The selection exercise and the ongoing quality program are the same artifact, which is the opposite of the typical pattern where a buyer builds an elaborate evaluation, picks a vendor, and then starts over from scratch on monitoring. The three-layer testing framework covers how the same scenarios serve selection and then regression, adversarial, and production-derived testing once you are live.

Where to go from here

A defensible voice agent vendor bake-off is not complicated, but it is real work: name the shape, weight the criteria before you listen to a pitch, mine your calls into a test suite, turn requirements into metrics, run every vendor identically (in rounds if the field is wide), score honestly, and decide against the weights you set first. Done this way, the decision survives a skeptical room because every step removed a way it could be dismissed, and the test suite you built keeps working for you after you sign.

The part most teams underestimate is the volume. Mining thousands of calls and grading thousands of conversations across vendors is exactly the work a single reviewer cannot do by hand and exactly the work an evaluation platform exists to do. Coval is the neutral measurement layer that sits between the buyer and the vendors: it ingests the calls, builds the test suite, runs every vendor through identical simulations, grades them with consistent metrics, and routes the calls that need a person to a human review queue. It is the instrument that makes the bake-off objective, not one of the contestants.

If you want the methodology behind the metrics, read the voice AI agent evaluation guide. If you are deciding whether to build your own evaluation harness or buy one for the bake-off, the build vs. buy framework covers that tradeoff. To understand why the demo and the deployment diverge, see why voice AI agents break in production and the five metrics that predict production success. For the persona-realism piece, accents, dialects, and multilingual testing covers the blind spot most test suites miss, and the three-layer testing framework shows how a bake-off test suite becomes ongoing regression coverage after you sign.

Run your next bake-off with rigor. Get the playbook PDF and the Claude Code skill to do it yourself, or have a Coval solutions engineer run it with you.

Frequently asked questions

How long should a voice agent vendor bake-off take?
A focused bake-off runs in two to six weeks, not a quarter. The long pole is building the test suite, and mining production calls compresses that because you are categorizing real conversations instead of inventing scenarios. A multi-round bake-off that culls a wide field before a deep finalist round lands toward the higher end of that range; a tight head-to-head lands toward the lower end. Running the vendors is fast when the runs happen in parallel and the scoring is automated. The teams that drag a bake-off across months almost always do it serially, one vendor and one manual review at a time, and lose the deployment window. Treat parallelism and automated grading as part of the method, not a nice-to-have.
How many production calls do I actually need to mine?
Enough to be representative of your real traffic distribution, which usually lands somewhere between 1,000 and 10,000 calls depending on how varied your use cases are. A narrow, structured agent (appointment confirmations) needs fewer; a broad customer-service line needs more. The goal is not volume for its own sake, it is coverage: every meaningful use case and caller type should appear often enough to weight. If you have no calls yet, synthesize the test suite from domain expertise and the common caller archetypes, then backfill from production the moment you are live.
Should I let a vendor see the test set before the run?
Share the criteria and the categories, withhold the specific scenarios. Vendors should know you will grade order accuracy, escalation handling, and total-confirmation readback, because that lets them set up a fair like-for-like configuration. What they should not see is the exact test suite, because an agent tuned to a known set of cases scores high on those cases and tells you nothing about the rest. A useful middle ground: give every vendor a small public sample for setup and integration, and hold the scored test suite back until the run. The vendor learns how they did afterward, never what the test contained before it.
What if no vendor passes the bake-off?
That is a legitimate and useful outcome, not a failed evaluation. It means your requirements are ahead of what the market reliably delivers for your use case. From there you either revise the requirements with eyes open about the tradeoff, run a limited production pilot with the closest two vendors to see if real traffic changes the picture, or have a build-vs-buy conversation. A forced signature on a vendor that did not clear your bar is the most expensive outcome of all.
How is a bake-off different from a proof of concept?
A POC asks "can this vendor do the thing at all?" against that one vendor's setup. A bake-off asks "which vendor does the thing best for my callers?" against an identical, shared test suite run across every contender. A POC can be useful as a pre-filter to decide who makes the finalist set, but it cannot pick a winner, because each vendor's POC runs under its own conditions. The bake-off's defining feature is that the conditions are held identical so the scores are comparable.
Can I reuse the bake-off test suite after I sign?
Yes, and reuse is the point, not a bonus. As the "what happens after you select a vendor" section above lays out, the selection test suite becomes your regression and production-monitoring suite the day the vendor goes live. The new mechanic post-signing is that the suite grows: every production failure you catch in monitoring gets turned into a new test case and added to the suite, so it becomes your institutional memory of every failure mode you have seen, and the agent gets re-checked against all of them before each release. A failure you ship once should never ship twice. Because the suite is vendor- and model-agnostic, it also survives a model swap or a vendor change without a rebuild. This is the coverage the three-layer testing framework describes, now seeded from the selection work you already did.
Who should own the bake-off internally?
One owner with the authority to lock the criteria and weights, usually the head of product or the eval lead, with engineering supplying the integration work and a senior stakeholder reviewing the weighted result. The failure mode to avoid is a committee that reweights criteria after seeing scores, or a senior leader who sits out the criteria-setting and then vetoes the data on a single call. Decide who holds the pen on the weights before the run, and make the senior preference one weighted input rather than a late override.

Get deployment-ready.