Voice Agent Went From 66% to 96% Accuracy. Here's the Exact Optimization Loop I Used.
How synthetic data generation, scenario simulation, and automated prompt optimization fixed a production voice agent without touching a single real user call.

Most voice agent testing loops look like this: you dial in, order something, decide it felt okay, and ship. I did this for longer than I should have on a drive-thru voice agent I was building called "Future Burger."
The problem with the vibe-check loop isn't just that it's unscientific. It's that it only tests the scenarios you can think of. The edge cases that break your agent in production are the ones you didn't write a test for.
Here's the loop I built to replace it, and the exact results it produced.
The Architecture Decision That Mattered Most
Before touching any optimization, I made one foundational call: the STT and TTS layers are peripherals. They're the ears and mouth, interchangeable.
The LLM is the brain. It handles reasoning, context tracking, mid-conversation state changes, and tool calling. If it can't figure out that "Actually, make that a Sprite" means replacing the drink rather than adding it, no amount of voice synthesis quality saves the interaction.
Every optimization effort went into the intelligence layer.
Step 1: Dataset
I needed a labeled dataset to evaluate against, but I had no real call logs. Rather than wait weeks for production traffic, I used FutureAGI's Dataset to build ground truth from scratch.
The schema had two fields: user_transcript (what the user says) and expected_order (the correct agent output).
Prompt used:
"Generate 500 diverse drive-thru interactions. Include complex orders like 'Cheeseburger no pickles', combo meals, and modifications."
500 labeled pairs in seconds. What immediately stood out was how many edge cases appeared that I hadn't planned for: mid-sentence order flips, multilingual switches, impatient customers cutting off the agent. These became the most valuable test cases in the dataset.
Step 2: Establishing a Baseline
I drafted the initial system prompt (v0.1), saved it as a versioned template, and ran it against the 500 synthetic scenarios across three models: gpt-5-nano, Gemini-3-Flash, and gpt-5-mini.
Baseline result: 80% accuracy.
The logic mostly held, but every response was a paragraph. Each one opened with something like:
"Certainly! I have updated your order to include a cheeseburger without pickles and a medium Sprite. Is there anything else I can help you with today?"
For a chatbot, that's fine. For a voice agent, every word adds latency. Verbosity is a failure mode, not a style choice.
Step 3: Stress Testing With Scenario Simulation
I connected the agent to a simulator and ran layered scenarios: hesitant users, stuttering, rushed customers, angry customers, and mid-order changes.
Three failure patterns surfaced immediately:
1. Latency from verbosity. The agent's multi-sentence responses created dead air in every interaction.
2. Context replacement logic. When a user changed their mind, the agent added both items to the cart instead of overwriting the first. A classic stateful context bug.
3. Overall success rate: 66%. One in three conversations was failing. That's not an edge case problem. That's a production blocker.
Step 4: Automated Prompt Optimization With ProTeGi
Manual prompt debugging is just pattern matching on logs. You run the agent, it fails, you guess what caused it, you edit, you run again. The loop is tedious and the signal is noisy.
I defined 10 evaluation criteria specific to this use case:
Context_Retention
Objection_Handling
Language_Switching
Because the evaluation runs against native audio rather than just transcripts, it surfaces failure patterns that text-only analysis misses entirely.
The optimizer identified two root causes:
Root Cause 1 (High Latency): "Reduce decision tree depth for menu inquiries and remove redundant validation steps."
Root Cause 2 (Hallucination): "Restrict generative capabilities to the defined
menu_itemsvector store to prevent inventing dishes."
I selected the failed simulation runs and ran the ProTeGi algorithm with two optimization objectives:
Task_Completion
Customer_Interruption_Handling
The system iterated automatically, testing variants like "Be extremely brief" and "If user changes mind, overwrite previous item" against the simulator in a feedback loop until the metrics improved.
I've spent hours doing this by hand on other projects. Watching it run automatically was a different experience entirely.
Results
| Metric | Before | After |
|---|---|---|
| Success Rate | 66% | 96% |
| Response Style | Multi-sentence paragraphs | Single crisp confirmations |
| Context Handling | Appended on change | Overwrites correctly |
| Latency Pattern | Verbose, slow | "Burger, no pickles. Got it." |
Going from 66% to 96% without writing a single new instruction manually confirmed the loop works: Dataset > Simulate > Evaluate > Optimize.
The Loop, Not the Tool
The cold start problem for voice agents is real. No users means no data. No data means no baseline. Synthetic simulation breaks that dependency.
The more important shift is recognizing that prompt debugging is automatable. The hard work is upfront: defining the right evaluation criteria for your specific agent. Once those are set, iteration becomes a system rather than a guessing game.
Full architecture walkthrough, including the simulation setup is documented here.
What evaluation criteria do you track in production voice agents? Context retention and interruption handling were the obvious ones for this use case. I'm curious what others measure that isn't in the standard rubrics.




