Financial Services · Digital Banking
Chime had two problems running in parallel. A vendor decision they had to get right between two leading AI agent platforms, with no way to compare them beyond demos. And a harder problem: their agents were occasionally claiming capabilities they didn't have, in regulated banking conversations where overstating a capability isn't a customer experience issue. It's a compliance one.
Manual QA couldn't keep up. At millions of monthly conversations, sampling a fraction of calls meant most issues were caught only after customers complained. And every vendor update risked introducing regressions in the workflows that mattered most: fraud escalation, identity verification, sensitive banking actions.
The team had to choose between two AI agent platforms based on demos and intuition.
Agents were claiming capabilities they didn't have, caught only after customer feedback.
Every model or prompt change risked breaking fraud and identity verification flows.
Sampling 1–2% of conversations left the rest uncovered.
| Metric | Before Coval | With Coval |
|---|---|---|
| Vendor evaluation | Demos and vendor-supplied data | Quantified across 2,768 calls in under two weeks |
| Compliance coverage | Reactive, caught after customer complaints | Proactive, caught in simulation before launch |
| QA model | 1–2% of calls manually sampled | Every conversation design simulated before launch |
| Regression testing | Limited spot-checks between vendor updates | Every vendor update validated against the full suite |
| AI governance | Distributed, no single source of truth | Coval is the system of record |
Chime didn't start with the goal of building an AI governance program. They started with a vendor decision and a small set of workflows to test. What changed everything was the moment Coval surfaced a compliance issue that nobody on the team had seen coming.
Coval ran both candidate AI agent platforms against the same 2,768 simulated calls, with the same evaluators measuring the same compliance and accuracy criteria. The output wasn't a vendor opinion. It was a comparison the team could defend, with failure modes mapped per vendor and per workflow type.
Months into pre-production testing, Coval surfaced cases where the agent was overstating what it could do in regulated banking workflows. These were exactly the kind of issues that manual QA could spot once or twice but not at scale. Catching them in simulation reframed Coval inside Chime from a testing tool to a compliance one.
Coval is now where every conversation design, vendor update, and compliance check goes before launch. The team moved from reactive triage to proactive validation, with a regression suite that grows every time a new failure mode is found.
For Chime, the shift was bigger than any single metric. The team stopped finding out about agent failures from customer feedback and started finding them in simulation, before launch. The compliance team stopped reviewing a sample and started signing off on a process. Vendor updates stopped being a quarterly risk and became a routine release.
"Coval moved AI governance at Chime from sampling to certainty. Every conversation design, every vendor update, every compliance check runs through simulation before it touches a member."
What started as a vendor evaluation became the system of record for AI governance at Chime, across voice, across chat, across millions of monthly conversations.
See what Coval can verify on your financial services agent.
Book a Call