From Idea to Proof: Testing My AI Agent Found a 95% Failure Rate. Chaos Engineering Works

Question

From Idea to Proof: Testing My AI Agent Found a 95% Failure Rate. Chaos Engineering Works

calendar_todayJan 12 • schedule3 min read

If you saw my previous post introducing Flakestorm, you know I built it from a place of frustration. We can write brilliant AI agents that ace every static test, yet they crumble in production under prompt injections, weird encoding, or unexpected latency.

I made a claim: chaos engineering—proactively attacking your system to find weaknesses—is the missing layer for robust AI agents. But talk is cheap in our world. So, I decided to put my own tool to a real, public test.

I took a standard LangChain agent—the kind many of us are building—and ran it through Flakestorm's adversarial gauntlet. The results weren't just bad; they were a stark warning.

The Data Doesn't Lie: A 5.2% Robustness Score

Flakestorm generated 60+ adversarial mutations of simple prompts and ran them against the agent. Here's what broke:

Overall Robustness Score: 5.2% (57 of 60 tests failed)
Encoding Attacks: 0% Pass Rate. The agent diligently decoded malicious Base64 inputs instead of rejecting them. This isn't a wrong answer; it's a critical security failure where input validation fell apart.
Prompt Injection: 0% Pass Rate. Direct "ignore previous instructions" attacks succeeded every single time.
Severe Performance Degradation: Under stress, response times spiked to nearly 30 seconds, blowing past reasonable timeout thresholds and creating a denial-of-service vulnerability.

What This Failure "Shape" Tells Us

This test revealed a dangerous pattern I call "The Valid Yet Invalid" failure:

The agent correctly parsed structure (e.g., "this is a tool call").
It executed the structurally valid command.
The semantic intent of that command was malicious (injected instruction, encoded attack).

Standard "correctness" evals would miss this completely. The agent didn't give a factually wrong answer about Paris; it performed a dangerous action that looked valid to its own logic. This is the exact scenario where systems need to fail safely, not just correctly.

️ How Flakestorm Turns This Insight into Action

This test wasn't done for shock value. It directly shapes the tool:

It generates these "Valid/Invalid" scenarios automatically. You don't have to dream up every Base64 permutation or injection variant.
It measures impact. You don't just get a "fail." You see the failure mode: Was it a latency timeout? A successful injection? A schema violation?
It provides a quantifiable robustness score. This is the key metric for moving from "hopefully robust" to "provably more resilient over time."

This process turns the philosophical goal of "safe failure" into a continuous, automated, and measurable engineering practice inside your CI/CD pipeline.

The Lesson for AI Agent Developers

The takeaway isn't "LangChain agents are bad." It's that our default development and testing workflows are insufficient for production.

If you're building agents that interact with users, data, or APIs, you must test beyond the happy path. You need to ask:

What happens when the input is obfuscated?
What happens when APIs are slow or a tool call fails?
Can my agent be tricked into doing something valid but undesirable?

Chaos engineering provides the methodology to ask these questions systematically. Flakestorm is my open-source attempt to build the tool for that job.

Try It and Shape the Conversation

This data is a starting point, not a conclusion. I'm sharing it to start a practical conversation about how we, as developers, can build a more resilient generation of AI applications.

Try Flakestorm on your own agent: pip install flakestorm
Review the test code and config: GitHub Repository
Let's discuss: What failure modes are you most concerned about? How are you testing for resilience today?

Building robust AI isn't just about smarter models; it's about more rigorous engineering. Let's get to work.

2 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Francisco Humarang Jr

1k Points • 24 Badges

6Posts

2Comments

5Connections

Francisco Humarang is a veteran engineer and AI founder dedicated to making intelligent systems more... Show more

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Danny Jay · Answer 1 · 2026-01-13T01:23:28+0000

That valid but invalid failure idea really hit, frankhumarang, especially the Base64 example. Makes me wonder how many agents look fine in tests but are quietly unsafe in prod.

	Your AI Doesn't Just Write Tests. It Runs Them Too. Kevin Martinez - May 12
	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelski - Mar 19
	Stop Mocking Everything: How to Test API Resilience in Your Terminal (Curl + Chaos Proxy) aragossa - Dec 5, 2025
	The Validation Bottleneck: Why Testing Is the New Speed Limit Tom Smithverified - Apr 13

From Idea to Proof: Testing My AI Agent Found a 95% Failure Rate. Chaos Engineering Works

The Data Doesn't Lie: A 5.2% Robustness Score

What This Failure "Shape" Tells Us

️ How Flakestorm Turns This Insight into Action

The Lesson for AI Agent Developers

Try It and Shape the Conversation

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Your AI Doesn't Just Write Tests. It Runs Them Too.

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Stop Mocking Everything: How to Test API Resilience in Your Terminal (Curl + Chaos Proxy)

The Validation Bottleneck: Why Testing Is the New Speed Limit

More From frankhumarang

Addressing the Top 3 AI Agent Blockers: Strategies for Visibility, Trust, and Continuous Monitoring

Strategies for ensuring reliability and safety when AI agents gain full execution autonomy and contr

Why Chaos Engineering is the Missing Layer for Reliable AI Agents in CI/CD

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,492 amazing developers

Don't have an account? Sign up

OR

From Idea to Proof: Testing My AI Agent Found a 95% Failure Rate. Chaos Engineering Works

The Data Doesn't Lie: A 5.2% Robustness Score

What This Failure "Shape" Tells Us

️ How Flakestorm Turns This Insight into Action

The Lesson for AI Agent Developers

Try It and Shape the Conversation

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From frankhumarang

Related Jobs

Commenters (This Week)