No headings found

Manual QA Doesn't Scale for Voice AI. Start There Anyway.

Henry Finkelstein, Founding Growth Engineer

Last Updated:

Apr 15, 2026

Last updated: April 2026
Reading time: 8 min

Key Takeaways

Manual QA is the right starting point for voice AI testing. Everyone starts here, and you should too.
There are 7 clear signals that you've outgrown it, from call volume to language coverage gaps.
Engineers feel the pain months before leadership does. That's normal.
When it's time to graduate, frame the case in each stakeholder's language: product, engineering, QA, compliance, and executive leadership each care about different things.
Outgrowing manual QA is a sign of success — your agent is getting more capable, which is exactly the goal.

Manual QA for voice AI is the practice of testing voice agents by having humans call the agent, listen to recordings, and read transcripts to assess quality by hand.

We build evaluation infrastructure for voice AI. And we're here to tell you: start with manual QA.

This might sound like a roofing company telling you to grab a tarp. But hear us out. Manual QA is the single most common starting point we see across every team we talk to. Healthcare, fintech, insurance, recruiting, quick-service restaurants. Everyone starts by picking up the phone, calling their own agent, and listening. Far from a red flag, it's a sign your agent is real, handling real calls, and growing in complexity.

The problem isn't that you started with manual QA. The problem is staying there after you've outgrown it.

When You Have Twenty Calls a Day, Listen to All of Them

At low volume, manual QA is the right call. Not a stopgap. Not a "we'll automate this later" compromise. The actual, correct approach.

When your agent handles a dozen calls a day, you can and should listen to every one. Read every transcript. You'll learn things no automated system can teach you yet:

Failure modes you'd never think to test for. A caller who trails off mid-sentence. Background noise that confuses the transcriber. An accent that shifts intent classification in a direction nobody predicted.
What "good" actually sounds like. Before you can write an eval rubric, you need a gut sense of quality. That comes from hours of listening, not from a specification document. (Hamel Husain calls this "going beyond vibe-checks to data-driven voice AI QA" — manual review is where the data starts.)
How wrong your assumptions are. The calls you imagined during development are not the calls you get in production. Not even close.

We recently talked to a Python engineer at an AI recruiting startup who had five people doing manual QA: "five people testing different machines, different scenarios, different scripts." At 20 calls per customer per month, it worked fine.

A director at one of the largest public-sector software companies in the US described their process as "literally developing a list of questions that we want to get answers to and then grading the quality of those scores by hand." This is a Fortune 500 company. You're in good company.

Here's a useful heuristic: an engineer making ~$120K/year costs roughly $1 per minute in direct salary. A 15-minute manual test call is $15 in direct labor alone. That's before you count the opportunity cost of that engineer not shipping product, or the morale hit of mind-numbing repetitive testing day after day. At low volume, $15 a call is a bargain for what you learn. At high volume, it's a budget line item that nobody approved.

Seven Signs You've Outgrown Manual QA

Read these as graduation signals, not failures. Each one means your agent is getting more capable, your user base is growing, or your team is shipping faster. All good things. They just mean manual QA can't keep up anymore.

1. Volume outpaces your team. That same recruiting startup? A client asked for 6,000 calls a month. The five-person QA team went from adequate to "unrealistic" overnight. A voice AI architect at a healthcare platform described the same cliff: they were running 200-300 test calls per sprint. Their target was 10,000-15,000 calls per month. Manual couldn't bridge that gap.

2. You're shipping faster than you can test. Every prompt tweak means another round of manual calls. The iteration cycle that used to take a day now takes a week. Your velocity is gated by how fast humans can pick up phones, not by how fast engineers can write code.

3. Things break in production that passed in testing. Your clean-room test calls don't cover what production actually sounds like. Real callers are on speakerphone in moving cars. They have accents. They're frustrated. They say "uh, wait, actually, no, the other thing." Your five testers in a quiet office aren't simulating any of that. (We wrote about this demo-to-production gap in depth in our voice AI agent evaluation guide.)

4. You're expanding functionality faster than coverage. New intents, new tool integrations, new backend connections, new conversation flows. Each one adds surface area that needs testing. Your test matrix is growing geometrically while your QA team is growing linearly (or not at all).

5. Your team doesn't speak all the languages you support. Ten languages supported, three native speakers on the QA team. Testing the other seven is guesswork. You know there are problems in Portuguese and Hindi. You have no way to quantify them. This is one area where automated simulation with diverse personas can cover ground that no human team realistically can.

6. The math stops working. A founder running 200,000 automated calls per month put it plainly: "Even if you're 99%, you're still getting 2,000 bad calls." At that scale, manually reviewing just the failures is a full-time job. Reviewing the successes for false positives is impossible.

7. You're testing instead of building. A senior ML engineer at a major insurance company described building evaluation pipelines from scratch as something that "takes away from building the actual service." Your best engineers are spending their weeks on QA calls instead of shipping the features that would actually move the product forward. That opportunity cost is invisible to leadership but it's real to the team.

The image that sticks with us: a voice AI architect at a healthcare platform describing their QA process as "a bunch of humans picking up the phone and trying hard to make it sound stupid." If that sentence makes you wince and laugh at the same time, you've probably outgrown manual QA.

Engineers Feel It First

Here's the pattern we see over and over. The engineer running the voice program knows manual QA is broken months before anyone else on the team feels it.

That recruiting startup engineer said it directly: "Besides me, Product, and maybe QA, nobody feels those pains yet." No senior executive champion. No burning platform at the leadership level. Just the developer on the ground who can see the cracks forming.

This is normal. The people closest to the system feel the strain first. Leadership is looking at top-line metrics that still look OK. Calls are completing. Customers aren't churning (yet). The agent handles the happy paths fine.

But you're the one white-knuckling every deploy. You're the one who knows that the 15-minute test call you ran before the last release didn't cover the edge case that broke in production on Tuesday. You're the one who can feel the gap between what you're testing and what your users are experiencing.

You're not being dramatic. You're pattern-matching correctly. And recognizing this is the first step toward doing something about it.

Making the Case Internally

You see the problem. Now you need everyone else to see it. The trick is framing the pain in each stakeholder's language, not yours. (For a deeper dive on why most teams skip evaluation infrastructure and how to make the case internally, we wrote a companion piece on exactly this.)

To Product: "Our QA coverage is falling behind our release velocity. We're shipping conversation flow changes we can't fully test, and user-facing quality gaps are showing up in [specific metric: CSAT, task completion rate, repeat caller rate]. We're flying blind on the user experience impact of every deploy."

To Engineering Leadership: "We're spending X engineer-hours per week on manual test calls. At $1 per minute, that's $Y per month in direct salary costs. Those are senior engineers not building product. Our iteration cycle has gone from one day to one week because every change requires a manual testing round."

To QA: "Our scenario coverage is at Z% and dropping every time we add a new intent or language. We're at [number] supported languages and [number] native speakers. Defects are escaping to production that we would have caught six months ago when the agent was simpler."

To Compliance: "We can't prove our agent handles [HIPAA disclosures / TCPA consent / mini-Miranda language / state-specific requirements] correctly across all scenarios. We test what we can reach manually, but we don't have coverage for [specific languages, specific edge cases, specific caller conditions]."

To the C-suite: "Our voice AI program is growing. That's the good news. The risk is that our QA process doesn't scale with it. We're one bad production week away from a customer-facing incident we can't catch manually. A proper evaluation platform costs a fraction of what one incident costs in engineering time, customer remediation, and reputational damage."

The $1-per-minute heuristic is your anchor in every conversation. It's concrete, easy to multiply by your team's hours, and hard to argue with. Pair it with the opportunity cost: "Every hour an engineer spends on manual test calls is an hour they're not shipping the next feature." And name the risk: one production incident with a real customer is more expensive than a year of evaluation tooling. For a framework on managing AI system risks responsibly, see NIST's AI Risk Management Framework.

What Graduating Looks Like

Outgrowing manual QA is a sign things are going well. Your agent is getting more capable. Your user base is growing. The complexity is increasing in the right directions. You didn't fail at QA. You succeeded at growth.

The transition doesn't have to be a cliff. Start by automating the simplest checks: what must your agent always do? What must it never do? Run those on every deploy. Keep manually reviewing the calls that matter most. Over time, shift the balance. The automated coverage expands. The manual review becomes targeted instead of exhaustive. The longer-term destination is a three-layer testing framework — regression, adversarial, and production-derived tests stacked on top of each other — fed by production monitoring that turns every real-world failure into a future test case.

We wrote Voice AI Agent Evaluation: The Complete Guide for exactly this moment. It picks up where manual QA leaves off and walks you through the full maturity curve, from your first automated evals through production monitoring and beyond.

Wherever you are right now, the fact that you're thinking about evaluation quality puts you ahead of most teams. Start by listening. Graduate when the signals tell you to. And when you're ready to talk about what comes next, reach out to the Coval team.

Frequently Asked Questions

When should I stop doing manual QA for voice AI?

You shouldn't fully stop. The signal isn't "stop manual QA," it's "stop relying on manual QA as your only line of defense." Watch for the seven graduation signals: volume outpacing your team, shipping faster than you can test, things breaking in production that passed in testing, expanding functionality faster than coverage, language gaps you can't manually staff, the math no longer working at scale, and engineers spending more time testing than building. When two or three of these are true, it's time to introduce automated evals alongside (not instead of) manual review.

How much does manual QA cost compared to automated evaluation?

The simple heuristic: an engineer at $120K/year costs roughly $1 per minute in direct salary. A 15-minute manual test call costs ~$15 in direct labor before counting the opportunity cost of the engineer not shipping product. At low volume (a dozen calls a day) that's a bargain for the intuition you build. At 250 calls a week or 6,000 a month, manual QA either becomes a full-time team or a bottleneck. Automated evaluation costs scale with simulation minutes and judge tokens — typically a fraction of equivalent human-hours once you're past a few hundred calls per week.

What's the first thing to automate when graduating from manual QA?

Two lists. What must your agent always do? What must it never do? Encode each as a binary metric, run those checks on every deploy, and require 100% pass rate. This is your regression floor. It's not glamorous and it's not comprehensive, but it catches the most expensive failure modes — compliance violations, identity-verification gaps, agents continuing past stop requests — and it runs in minutes instead of hours.

How do I convince leadership that manual QA isn't scaling?

Frame it in each stakeholder's language, not yours. To Engineering Leadership: hours-per-week at $1/minute and the slowdown in iteration velocity. To Product: user-experience gaps you can see but can't catch before deploy. To Compliance: scenarios you can't prove coverage on. To the C-suite: one bad production week costs more than a year of evaluation tooling. The $1-per-minute heuristic is your anchor: concrete, multipliable, hard to argue with. Pair it with the opportunity cost — every hour an engineer spends on manual test calls is an hour they're not shipping the next feature.

Can I use both manual and automated QA together?

Yes — and the best teams do. Automated coverage handles regression, scale, and repeatable checks. Manual review handles the calls that matter most: complaints, escalations, edge cases your eval missed, novel failure modes worth turning into future test cases. Over time, the balance shifts. Automated coverage expands as you encode more of your intuition into rubrics. Manual review becomes targeted instead of exhaustive. Both still earn their place.

See how Coval can help you improve your agents.

Book a call