How to Evaluate AI Voice Agents: A Practical Buyer's Framework
Most AI voice agent evaluations measure the wrong things. Buyers spend three weeks watching scripted demos, comparing vendor benchmark tables, and counting features on G2 — then deploy and discover that the agent that won the bake-off can’t handle a real caller asking three things in one sentence.
We’ve watched this play out across fiber ISPs, legal services, and US contact centers, and the pattern is consistent. This is a practical framework for evaluating AI voice agents in 2026 — what to test, what to ignore, and the criteria that actually predict whether a vendor’s agent will survive contact with your callers.
What an AI voice agent actually is — and what it isn’t
The term gets used loosely. Some vendors call any voice interface an “AI voice agent” — including IVR menus with a slightly better speech recognizer bolted on. Others reserve the term for systems that can hold multi-turn conversations, make decisions, and execute actions. The evaluation framework starts with getting precise about what you’re actually buying.
An AI voice agent, as we use the term, has four layers:
Telephony handles the call: SIP trunking, codec negotiation, media routing — the boring infrastructure that decides whether the audio arrives clean.
Automatic speech recognition (ASR) converts the caller’s audio into text in near real time.
A large language model orchestrates the conversation, reading the text, deciding what to say, and making tool calls to look up account data or update a ticket.
Text-to-speech (TTS) renders the response back to audio. Loop, repeat, until the call ends.
What separates an AI voice agent from an IVR is the orchestration layer. An IVR follows a decision tree. The caller says “billing,” it routes to billing. The caller says something unexpected, it falls back to “please say one of the following options” or transfers to a human. An AI voice agent reasons. It holds context across turns. It can handle a caller who interrupts themselves, changes the subject, or asks two questions at once.
What separates a voice agent from a chatbot is voice as the medium. That sounds obvious, but the implications aren’t. Voice has no scrollback. Voice forces the agent to handle interruptions, overlapping speech, and silence. Voice requires sub-second response or the caller hangs up. The same LLM that powers a chatbot needs a completely different surrounding architecture to work over a phone line.
What separates it from older voicebots — the rule-based, hand-scripted speech systems contact centers deployed in the 2010s — is that the conversation is generated, not pre-written. The agent doesn’t follow a script. It works from a goal, a set of tools, and the context of the call so far.
This is the part most vendor demos do well and most production deployments do poorly. Generating coherent speech in a quiet conference room with a friendly tester is one job. Generating coherent speech when the caller’s three-year-old is screaming and the agent has to verify identity, look up an account, and answer a billing question is another job entirely.
This distinction matters for evaluation because the four layers fail in different ways. Most vendor demos are designed to make all four layers look great at once. Production traffic exposes the layer that’s weakest.
Where AI voice agents work, where they fail
Voice agents earn their keep on narrow, high-frequency, repeatable call types where the answer can be sourced from data the agent can read. They struggle — sometimes badly — on calls that require judgment, escalation, or empathy.
What works in 2026: Inbound deflection of L1 support — order status, account balance, appointment rescheduling, service status checks, password resets. Outbound transactional — appointment confirmations, payment reminders, surveys, service notifications. Intake and qualification — reception, lead capture, initial information-gathering for professional services. Switchboard duty — reception, after-hours coverage, language routing.
The common thread: the agent’s job is to figure out what’s needed, source the answer, and either resolve it or route it.
What still doesn’t work cleanly: Calls that require judgment under uncertainty — “my internet has been slow for two weeks and I think it’s the router but my neighbor said it’s the area” — where the agent either over-promises a fix it can’t deliver or escalates immediately and adds latency. Emotional calls — cancellations, complaints, bereavement intake — where agents can take the information cleanly, but customer experience suffers without a human in the loop. Calls requiring multi-system reasoning, where the answer requires pulling data from three systems, reconciling inconsistencies, and explaining a decision. Error rates climb fast.
We’ve deployed six AI voice agents at a US fiber operator in Florida — three live on inbound, integrated into a NICE CXone environment using a two-number SIP architecture to work around standard limitations. We’ve shipped conversation intelligence at a UK fiber ISP serving 150K residents in student housing, where AI handlers and human engineers work side by side on tier-1 support. We’re scoping AI intake for a UK housing solicitor handling roughly 2,000 inbound calls a day across homelessness and housing disrepair cases.
The pattern across all three: the workloads that move are narrow, high-frequency, and answerable from data the agent can actually see. Everything else still belongs to humans.
This shapes the evaluation. If you’re shopping a voice agent for a use case that requires judgment, your evaluation criteria are different than if you’re shopping for high-volume L1 deflection. The framework below assumes the second case — because that’s where most of the production value lives in 2026, and that’s where the buying mistakes are most expensive.
Why vendor demos are a poor predictor of production
Demo theater is a problem. Every voice agent vendor has a demo. The demo is curated. The demo is rehearsed. The demo uses scenarios the vendor knows their agent handles well. Three specific problems with the demo as an evaluation tool:
Selection bias in the test scenarios. Vendors pick scenarios that showcase their agent’s strengths. They don’t pick the scenario where a caller speaks over the agent, says “uh, hang on, my kid is screaming,” and asks a follow-up question that contradicts what they said 90 seconds earlier. That call exists in your production traffic. It doesn’t exist in the demo.
The audio quality cheat. Demo audio is clean. The agent is talking to someone on a high-quality headset in a quiet room. Your callers are on speakerphone in a moving car, in a kitchen with the dishwasher running, on a phone with bad coverage. ASR accuracy drops materially when the input audio moves from studio-grade to typical mobile-call-grade. Demos rarely surface this.
The latency lie. Vendors quote median latency. Median latency is fine. Production has tails. A voice agent with a median response time of 600ms but a p99 of 4 seconds will sound broken on 1 in 100 turns — which adds up to several broken turns per call. The number that matters is p95 or p99 latency, under realistic load, on the LLM tier you’ll actually run in production.
Demos are useful for ruling vendors out. They are not useful for ruling them in. The evaluation has to test the agent on conditions the vendor didn’t choose.
The criteria that actually predict success
Four dimensions predict whether a voice agent will work in your production environment. Most vendor scorecards cover one or two and skip the rest.
Conversational quality and recovery
Conversational quality isn’t “does the agent sound natural.” It’s “what does the agent do when the conversation goes off script.”
Mid-turn interruption — caller cuts the agent off. Does it stop talking immediately, or finish its sentence?
Topic shift — caller starts asking about one thing and mid-sentence pivots to another. Does the agent follow?
Ambiguous request — caller says something that could mean two things. Does the agent ask for clarification or guess?
Out-of-scope handling — caller asks about something the agent isn’t trained for. Does it fail gracefully and offer transfer, or hallucinate?
Most agents do fine when the conversation goes the way they expect. The agents worth buying do fine when it doesn’t.
Latency, accuracy, and the technical fundamentals
End-to-end latency — the time from when the caller stops talking to when the agent starts responding — is the single hardest thing to fake in production. The thresholds that matter:
Caller doesn't notice the delay.
Caller wonders if there's a delay.
Caller repeats themselves or hangs up.
ASR accuracy matters but usually isn’t the bottleneck — modern ASR is reasonably good in 2026. Where it falls apart is on accents, low-bandwidth audio, and overlapping speakers. Test with audio that reflects your actual caller base, not the vendor’s training set.
TTS quality matters less than latency. A slightly robotic voice that responds in 600ms beats a perfectly natural voice that responds in 2 seconds. Buyers obsess over TTS naturalness because it’s the part of the experience that’s most aesthetic. Operationally, it’s not the bottleneck.
Integration depth
Integration depth is where most pilots die. The agent can have great conversational quality and ship in production, then fail because it can’t actually update the ticketing system, write back to the CRM, or pull live data from the telephony platform.
Test for:
Two-way integration — can the agent both read from and write to your systems of record?
Real-time data access — can it pull live account state, or only cached snapshots?
Telephony integration — does it work natively with your contact center platform (NICE, Genesys, Talkdesk, Zendesk, AWS Connect) or require workarounds?
Authentication flows — can it verify a caller’s identity with the same checks your human agents use?
Vendors will claim integration with everything. The test is how deep the integration goes. “We can read order status from Shopify” is shallow. “We can read order status, update the order if the customer requests a change, write a note to the customer record, and trigger a webhook to your fulfillment system” is deep.
Reliability and observability
When the agent goes wrong, can you tell? Can you tell fast? Can you intervene?
Real-time monitoring — can you see the agent’s live conversation state, or only after-the-fact transcripts?
Failure modes — what happens when the LLM provider has an outage? Does the call drop, queue, or fall through to a human?
Versioning — when you update the agent’s prompt or tools, can you roll back if it breaks?
Conversation intelligence — can you analyze 100% of agent conversations after the fact, or only sample?
Reliability is invisible until something fails. By the time you’re noticing reliability problems in production, you’re explaining outages to your CEO. Ask the vendor for their last 12 months of incidents and how each was handled.
The metrics most buyers track wrong
The dominant metric in voice agent evaluations is containment rate — the percentage of calls the agent handles without transferring to a human. Vendors lead with it. Buyers ask about it. RFPs require it. And by itself, it’s misleading.
Containment without CSAT is a vanity metric. An agent that contains 80% of calls but tanks customer satisfaction is a net loss — you’ve shifted cost from labor to churn, and churn is usually more expensive. The pairing matters: containment goes up, CSAT must hold or improve. If CSAT drops more than 2–3 points, the deployment is failing whatever the containment number says.
The metrics that actually matter:
Containment rate paired with CSAT — track both; if they move in opposite directions, something is wrong.
Successful task completion — did the caller actually achieve what they called for? Calls can end with a transfer because the agent correctly identified that it needed help (success), or end without a transfer because the caller gave up (failure). Both look the same in a containment column.
Average handle time for completed calls — a faster average doesn’t mean a better agent if the calls being resolved fast are the easy ones the human team also could have handled fast.
Cost per resolved call — total operating cost divided by calls actually resolved. This is what your CFO will ask about. It’s the metric vendor pricing pages obscure most aggressively.
Escalation rate by reason — “agent didn’t understand” is a problem; “caller asked for human” is a different problem; “outside agent scope” might not be a problem at all.
The pricing model question most buyers skip
Most voice agent pricing falls into one of three buckets, and the bucket matters more than the headline number.
Per-minute SaaS. $0.15–$0.50 per minute of conversation, charged on agent talk time. Simple to budget, easy to model. The trap: at high volume, the math gets ugly fast. A use case running 100,000 minutes a month at $0.30/min is $30K a month — and that’s before the vendor adds their orchestration fee, their LLM fee, or their integration fee. Vendors price-anchor on small pilots and pretend the per-minute number scales linearly. It rarely does.
Per-call or per-resolved-call SaaS. $0.50–$3 per call depending on length and complexity. Cleaner unit economics, but vendors lock down the definition of “resolved” to protect their margins. Read the definition carefully — a call that escalates to a human in the last 30 seconds may not count as resolved, even if 95% of the work was done by the agent.
Owned infrastructure. You pay setup and maintenance, not usage. Implementation cost is higher — typically £100K–£250K depending on scope — but the marginal cost of additional volume is near-zero. The agent is yours. The integrations are yours. The data stays in your environment. For high-volume, mission-critical inbound, the total cost of ownership over 24 months is usually a fraction of the SaaS equivalent.
The right pricing model depends on the deployment shape. Outbound campaigns with predictable volume can work on per-call SaaS. High-volume, mission-critical inbound is usually better on owned infrastructure. Mid-volume inbound with variable patterns sits in the middle.
The question buyers should ask early — earlier than they usually do — is “what does this cost me at 2x my current volume?” If the answer is “we’ll re-quote,” the vendor’s pricing isn’t ready for production.
The other question worth asking: where does the data live? Voice agent deployments touch sensitive customer data — call recordings, transcripts, account details, payment information. If the vendor’s architecture means that data leaves your environment, your compliance posture has to flex around theirs. For regulated workloads in healthcare, financial services, or legal, this is often the constraint that decides the buy. We build to SOC 2, HIPAA, ISO 27001, and PCI-DSS standards, and for clients in regulated industries, the deployment usually lands on owned infrastructure for exactly this reason.
How to run a 90-day evaluation that doesn’t waste anyone’s time
Most voice agent evaluations take 4–6 months and produce inconclusive results. The structure below is tighter and produces a decision.
-
Map your top 10 call intents
Pull 90 days of call recordings. Cluster into intent buckets. 80% of your call volume sits in the top 10. These are what the agent will be evaluated on.
-
Pick three intents for the pilot scope
Choose narrow, high-frequency, answerable-from-data intents. Resist the temptation to test the hard cases first — the hard cases rarely survive a 90-day pilot, and including them just kills the pilot.
-
Define success criteria before any vendor sees them
Containment target. CSAT floor. Latency budget. AHT range. If the agent doesn’t hit these, the pilot fails. Write them down. Get sign-off from the operational owner. Don’t let success criteria drift mid-pilot to accommodate a vendor’s underperformance.
-
Shortlist three vendors. Not five. Not seven.
Evaluating more than three in parallel dilutes effort. Pre-screen aggressively before the pilot starts using a structured RFP that includes the criteria above.
-
Run a structured 30-day pilot
Same call traffic, same integrations, same measurement. Each vendor gets the same scope. No vendor gets to customize the evaluation to their strengths. If a vendor refuses pilot parity, that’s the answer — they know their agent doesn’t survive an honest comparison.
-
Audit 100% of pilot calls, not a sample
Manual sampling is how vendors hide tail-latency problems and edge-case failures. Use a conversation intelligence platform if you have one, or budget for one if you don’t. Sampling is what got the contact center industry into this mess in the first place.
-
Define kill criteria up front
Decide what failure looks like before you start, so politics don’t drag a failing pilot across the finish line. The hardest part of a pilot isn’t running it — it’s killing it when the numbers say to.
-
Decide in week 12
Yes, expand. No, sunset. Maybe means no. Indecision past 90 days is how voice agent evaluations turn into permanent science projects.
The 90-day evaluation is faster than most buyers think possible and slower than most vendors will accept. Vendors who push back on a structured pilot are telling you something. Vendors who lean into it are usually the ones whose agents survive the evaluation.
What good looks like at the end of the 90 days: you have either three deployment-ready agents, or one. You know which one. You know what you’d pay for it. You know what production volume looks like at full scale. And you can defend the decision in front of your CFO without hand-waving.
The category is changing fast. The vendor landscape in 2026 looks nothing like it did in 2024, and the gap between vendors who can survive production and vendors whose demos look great has widened.
The evaluation that works is the one that tests for production reality, not demo polish. Build the framework, run the pilot, and trust the numbers over the pitch deck. The buyers who do this well in 2026 end up with voice agents handling real call volume at lower cost than their human team could. The buyers who skip the framework end up with a quietly retired pilot and a board deck explaining why the AI strategy is “still maturing.”