From Idea to Proof: Testing My AI Agent Found a 95% Failure Rate. Chaos Engineering Works

From Idea to Proof: Testing My AI Agent Found a 95% Failure Rate. Chaos Engineering Works

posted 3 min read

If you saw my previous post introducing Flakestorm, you know I built it from a place of frustration. We can write brilliant AI agents that ace every static test, yet they crumble in production under prompt injections, weird encoding, or unexpected latency.

I made a claim: chaos engineering—proactively attacking your system to find weaknesses—is the missing layer for robust AI agents. But talk is cheap in our world. So, I decided to put my own tool to a real, public test.

I took a standard LangChain agent—the kind many of us are building—and ran it through Flakestorm's adversarial gauntlet. The results weren't just bad; they were a stark warning.

The Data Doesn't Lie: A 5.2% Robustness Score

Flakestorm generated 60+ adversarial mutations of simple prompts and ran them against the agent. Here's what broke:

  • Overall Robustness Score: 5.2% (57 of 60 tests failed)
  • Encoding Attacks: 0% Pass Rate. The agent diligently decoded malicious Base64 inputs instead of rejecting them. This isn't a wrong answer; it's a critical security failure where input validation fell apart.
  • Prompt Injection: 0% Pass Rate. Direct "ignore previous instructions" attacks succeeded every single time.
  • Severe Performance Degradation: Under stress, response times spiked to nearly 30 seconds, blowing past reasonable timeout thresholds and creating a denial-of-service vulnerability.

What This Failure "Shape" Tells Us

This test revealed a dangerous pattern I call "The Valid Yet Invalid" failure:

  1. The agent correctly parsed structure (e.g., "this is a tool call").
  2. It executed the structurally valid command.
  3. The semantic intent of that command was malicious (injected instruction, encoded attack).

Standard "correctness" evals would miss this completely. The agent didn't give a factually wrong answer about Paris; it performed a dangerous action that looked valid to its own logic. This is the exact scenario where systems need to fail safely, not just correctly.

️ How Flakestorm Turns This Insight into Action

This test wasn't done for shock value. It directly shapes the tool:

  1. It generates these "Valid/Invalid" scenarios automatically. You don't have to dream up every Base64 permutation or injection variant.
  2. It measures impact. You don't just get a "fail." You see the failure mode: Was it a latency timeout? A successful injection? A schema violation?
  3. It provides a quantifiable robustness score. This is the key metric for moving from "hopefully robust" to "provably more resilient over time."

This process turns the philosophical goal of "safe failure" into a continuous, automated, and measurable engineering practice inside your CI/CD pipeline.

The Lesson for AI Agent Developers

The takeaway isn't "LangChain agents are bad." It's that our default development and testing workflows are insufficient for production.

If you're building agents that interact with users, data, or APIs, you must test beyond the happy path. You need to ask:

  • What happens when the input is obfuscated?
  • What happens when APIs are slow or a tool call fails?
  • Can my agent be tricked into doing something valid but undesirable?

Chaos engineering provides the methodology to ask these questions systematically. Flakestorm is my open-source attempt to build the tool for that job.

Try It and Shape the Conversation

This data is a starting point, not a conclusion. I'm sharing it to start a practical conversation about how we, as developers, can build a more resilient generation of AI applications.

  • Try Flakestorm on your own agent: pip install flakestorm
  • Review the test code and config: GitHub Repository
  • Let's discuss: What failure modes are you most concerned about? How are you testing for resilience today?

Building robust AI isn't just about smarter models; it's about more rigorous engineering. Let's get to work.

1 Comment

1 vote
0

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

What Is an Availability Zone Explained Simply

Ijay - Feb 12

Stop Mocking Everything: How to Test API Resilience in Your Terminal (Curl + Chaos Proxy)

aragossa - Dec 5, 2025

Why most people quit AWS

Ijay - Feb 3

Defending Against AI Worms: Securing Multi-Agent Systems from Self-Replicating Prompts

alessandro_pignati - Apr 2
chevron_left

Related Jobs

Commenters (This Week)

1 comment
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!