If you saw my previous post introducing Flakestorm, you know I built it from a place of frustration. We can write brilliant AI agents that ace every static test, yet they crumble in production under prompt injections, weird encoding, or unexpected latency.
I made a claim: chaos engineering—proactively attacking your system to find weaknesses—is the missing layer for robust AI agents. But talk is cheap in our world. So, I decided to put my own tool to a real, public test.
I took a standard LangChain agent—the kind many of us are building—and ran it through Flakestorm's adversarial gauntlet. The results weren't just bad; they were a stark warning.
The Data Doesn't Lie: A 5.2% Robustness Score
Flakestorm generated 60+ adversarial mutations of simple prompts and ran them against the agent. Here's what broke:
- Overall Robustness Score: 5.2% (57 of 60 tests failed)
- Encoding Attacks: 0% Pass Rate. The agent diligently decoded malicious Base64 inputs instead of rejecting them. This isn't a wrong answer; it's a critical security failure where input validation fell apart.
- Prompt Injection: 0% Pass Rate. Direct "ignore previous instructions" attacks succeeded every single time.
- Severe Performance Degradation: Under stress, response times spiked to nearly 30 seconds, blowing past reasonable timeout thresholds and creating a denial-of-service vulnerability.
What This Failure "Shape" Tells Us
This test revealed a dangerous pattern I call "The Valid Yet Invalid" failure:
- The agent correctly parsed structure (e.g., "this is a tool call").
- It executed the structurally valid command.
- The semantic intent of that command was malicious (injected instruction, encoded attack).
Standard "correctness" evals would miss this completely. The agent didn't give a factually wrong answer about Paris; it performed a dangerous action that looked valid to its own logic. This is the exact scenario where systems need to fail safely, not just correctly.
️ How Flakestorm Turns This Insight into Action
This test wasn't done for shock value. It directly shapes the tool:
- It generates these "Valid/Invalid" scenarios automatically. You don't have to dream up every Base64 permutation or injection variant.
- It measures impact. You don't just get a "fail." You see the failure mode: Was it a latency timeout? A successful injection? A schema violation?
- It provides a quantifiable robustness score. This is the key metric for moving from "hopefully robust" to "provably more resilient over time."
This process turns the philosophical goal of "safe failure" into a continuous, automated, and measurable engineering practice inside your CI/CD pipeline.
The Lesson for AI Agent Developers
The takeaway isn't "LangChain agents are bad." It's that our default development and testing workflows are insufficient for production.
If you're building agents that interact with users, data, or APIs, you must test beyond the happy path. You need to ask:
- What happens when the input is obfuscated?
- What happens when APIs are slow or a tool call fails?
- Can my agent be tricked into doing something valid but undesirable?
Chaos engineering provides the methodology to ask these questions systematically. Flakestorm is my open-source attempt to build the tool for that job.
Try It and Shape the Conversation
This data is a starting point, not a conclusion. I'm sharing it to start a practical conversation about how we, as developers, can build a more resilient generation of AI applications.
- Try Flakestorm on your own agent:
pip install flakestorm
- Review the test code and config: GitHub Repository
- Let's discuss: What failure modes are you most concerned about? How are you testing for resilience today?
Building robust AI isn't just about smarter models; it's about more rigorous engineering. Let's get to work.