Voice Agent Went From 66% to 96% Accuracy. Here's the Exact

Most voice agent testing loops look like this: you dial in, order something, decide it felt okay, and ship. I did this for longer than I should have on a drive-thru voice agent I was building called "Future Burger."

The problem with the vibe-check loop isn't just that it's unscientific. It's that it only tests the scenarios you can think of. The edge cases that break your agent in production are the ones you didn't write a test for.

Here's the loop I built to replace it, and the exact results it produced.

The Architecture Decision That Mattered Most

Before touching any optimization, I made one foundational call: the STT and TTS layers are peripherals. They're the ears and mouth, interchangeable.

The LLM is the brain. It handles reasoning, context tracking, mid-conversation state changes, and tool calling. If it can't figure out that "Actually, make that a Sprite" means replacing the drink rather than adding it, no amount of voice synthesis quality saves the interaction.

Every optimization effort went into the intelligence layer.

Step 1: Dataset

I needed a labeled dataset to evaluate against, but I had no real call logs. Rather than wait weeks for production traffic, I used FutureAGI's Dataset to build ground truth from scratch.

The schema had two fields: user_transcript (what the user says) and expected_order (the correct agent output).

Prompt used:

"Generate 500 diverse drive-thru interactions. Include complex orders like 'Cheeseburger no pickles', combo meals, and modifications."

500 labeled pairs in seconds. What immediately stood out was how many edge cases appeared that I hadn't planned for: mid-sentence order flips, multilingual switches, impatient customers cutting off the agent. These became the most valuable test cases in the dataset.

Step 2: Establishing a Baseline

I drafted the initial system prompt (v0.1), saved it as a versioned template, and ran it against the 500 synthetic scenarios across three models: gpt-5-nano, Gemini-3-Flash, and gpt-5-mini.

Baseline result: 80% accuracy.

The logic mostly held, but every response was a paragraph. Each one opened with something like:

"Certainly! I have updated your order to include a cheeseburger without pickles and a medium Sprite. Is there anything else I can help you with today?"

For a chatbot, that's fine. For a voice agent, every word adds latency. Verbosity is a failure mode, not a style choice.

Step 3: Stress Testing With Scenario Simulation

I connected the agent to a simulator and ran layered scenarios: hesitant users, stuttering, rushed customers, angry customers, and mid-order changes.

Three failure patterns surfaced immediately:

1. Latency from verbosity. The agent's multi-sentence responses created dead air in every interaction.

2. Context replacement logic. When a user changed their mind, the agent added both items to the cart instead of overwriting the first. A classic stateful context bug.

3. Overall success rate: 66%. One in three conversations was failing. That's not an edge case problem. That's a production blocker.

Step 4: Automated Prompt Optimization With ProTeGi

Manual prompt debugging is just pattern matching on logs. You run the agent, it fails, you guess what caused it, you edit, you run again. The loop is tedious and the signal is noisy.

I defined 10 evaluation criteria specific to this use case:

Context_Retention

Objection_Handling

Language_Switching

Because the evaluation runs against native audio rather than just transcripts, it surfaces failure patterns that text-only analysis misses entirely.

The optimizer identified two root causes:

Root Cause 1 (High Latency): "Reduce decision tree depth for menu inquiries and remove redundant validation steps."
Root Cause 2 (Hallucination): "Restrict generative capabilities to the defined menu_items vector store to prevent inventing dishes."

I selected the failed simulation runs and ran the ProTeGi algorithm with two optimization objectives:

Task_Completion

Customer_Interruption_Handling

The system iterated automatically, testing variants like "Be extremely brief" and "If user changes mind, overwrite previous item" against the simulator in a feedback loop until the metrics improved.

I've spent hours doing this by hand on other projects. Watching it run automatically was a different experience entirely.

Results

Metric	Before	After
Success Rate	66%	96%
Response Style	Multi-sentence paragraphs	Single crisp confirmations
Context Handling	Appended on change	Overwrites correctly
Latency Pattern	Verbose, slow	"Burger, no pickles. Got it."

Going from 66% to 96% without writing a single new instruction manually confirmed the loop works: Dataset > Simulate > Evaluate > Optimize.

The Loop, Not the Tool

The cold start problem for voice agents is real. No users means no data. No data means no baseline. Synthetic simulation breaks that dependency.

The more important shift is recognizing that prompt debugging is automatable. The hard work is upfront: defining the right evaluation criteria for your specific agent. Once those are set, iteration becomes a system rather than a guessing game.

Full architecture walkthrough, including the simulation setup is documented here.

What evaluation criteria do you track in production voice agents? Context retention and interruption handling were the obvious ones for this use case. I'm curious what others measure that isn't in the standard rubrics.

Voice Agent Went From 66% to 96% Accuracy. Here's the Exact Optimization Loop I Used.

The Architecture Decision That Mattered Most

Step 1: Dataset

Step 2: Establishing a Baseline

Step 3: Stress Testing With Scenario Simulation

Step 4: Automated Prompt Optimization With ProTeGi

Results

The Loop, Not the Tool

Comments

More from this blog

Your APM Looks Green. Users Are Still Abandoning Calls.

TTS API Comparison 2026: Choosing the Right Text-to-Speech Provider for Production

What I Learned Testing Voice Agents at Scale: From 50 Manual Calls to 10,000 Automated Scenarios

LiteLLM Got Backdoored: Full Technical Breakdown, Incident Response, and Why Self-Hosted LLM Proxies Are a Liability

Command Palette

The Architecture Decision That Mattered Most

Step 1: Dataset

Step 2: Establishing a Baseline

Step 3: Stress Testing With Scenario Simulation

Step 4: Automated Prompt Optimization With ProTeGi

Results

The Loop, Not the Tool

Comments

More from this blog