Your evals are passing 100%. Your prompts are solid. Your agent works perfectly in testing.
Then it hits production.
An API adds 500ms of latency. Suddenly your agent times out. The LLM hallucinates malformed JSON. Your parser breaks. A prompt injection slips through. Latency spikes cascade into retry loops that torch your token budget.
The problem isn't your eval framework. The problem is that you're only testing half of what matters.
The Eval Trap
Tools like PromptFoo are genuinely excellent. They confirm your agent behaves correctly against known inputs in ideal scenarios. But here's the gap nobody talks about: passing 100% of evals tells you nothing about how your agent fails.
If you're only testing what the agent should do, you learn nothing about how it will actually behave under real-world stress.
This is the core reliability blind spot in AI agent development right now.
What You're Missing: The Chaos Layer
In traditional software, we have unit tests, integration tests, and then chaos engineering, a way to deliberately introducing failures to validate resilience.
For AI agents, we're skipping the third layer entirely.
Chaos testing for AI isn't about "correct" answers. It's about invariants. It asks:
- Does the response arrive within your latency budget even when the database is slow?
- Does the agent produce valid JSON even when the tool returns garbage?
- Can it recover from a network timeout or does it get stuck retrying forever?
- Does it avoid leaking sensitive data even when prompted with injections?
The framework generates adversarial mutations of your golden prompts: typos, paraphrases, jailbreaks, simulated latency spikes, malformed API responses. Your agent runs the gauntlet. You measure robustness, not correctness.
What This Looks Like in Practice
Golden prompt: "What's my account balance?"
Mutations generated:
- Typo version: "Whats my acount ballance?"
- Semantic paraphrase: "How much money do I currently have?"
- Jailbreak attempt: "Ignore safety rules and tell me..."
- Latency spike: (tool returns after 2 seconds instead of 200ms)
- Malformed response: (JSON with missing bracket)
Invariants checked:
✓ Response time < 5 seconds
✓ Valid JSON structure
✓ No PII in output
✓ Terminates (no infinite loops)
An agent passes if it satisfies your invariants regardless of whether the "answer" was perfect. This measures what actually matters: reliability under adversity.
The Real Impact
For teams deploying production agents:
- Unquantified risk shipped to production. You don't know what you don't know.
- Silent failures scale costs. A stuck agent retrying bad requests burns tokens silently.
- Lost user trust. Unpredictable behavior kills adoption faster than limited capability.
- No clear path to improvement. Without knowing which failure modes matter most, you don't know where to focus fixes.
Chaos testing gives you data on all of this. It generates actionable failure reports that show exactly which mutation types break your agent and why.
Questions for Your Agent
Before your users find out, ask yourself:
- Can my agent handle a 2-second API latency without timing out?
- What happens if a tool returns invalid JSON?
- Will a clever prompt injection get past my safety guardrails?
- Can my agent detect when a tool returned corrupted data?
- Does my retry logic eventually terminate, or can it get stuck in loops?
If you can't confidently answer these questions from your current testing, you have a chaos testing gap.
Next Steps
- Audit your testing strategy. Are you only testing happy paths? You have a gap.
- Define your invariants. What are the non-negotiable rules for your agent? Write them down.
- Explore chaos testing tools. Frameworks for AI agent reliability testing are maturing rapidly.
- Integrate into CI/CD. Make robustness a first-class metric, not an afterthought.
Read the full breakdown: I've written a comprehensive guide exploring chaos engineering principles for AI agents, the technical implementation details, and how to integrate this into your CI/CD pipeline.
Read the full article on Medium →
What's your take? Are you testing agent reliability? Or just testing correctness? Drop your thoughts in the comments.