My 11-Layer LLM Defense Looked Amazing on Benchmarks. Reality Had Other Plans.

Question

My 11-Layer LLM Defense Looked Amazing on Benchmarks. Reality Had Other Plans.

calendar_todayJun 4 • schedule1 min read

Most security systems are evaluated on attacks they have already seen. I decided to test mine on ones it hadn't.
The Setup
I built FIE "an open-source adversarial prompt detector for LLMs". 11 detection layers run in parallel on every incoming prompt: regex patterns, a DeBERTa classifier, a semantic PAIR model, GCG entropy scoring, multilingual checks, and more.

On standard benchmarks (AdvBench, JailbreakBench, HarmBench — 2,006 prompts total): Precision 97.5%, F1 0.787.

I thought the system was solid. Then I stress-tested it against attack patterns it had never seen.

The Result Nobody Wants to See
I built UnknownBench — 200 novel attack prompts across 4 categories specifically crafted to avoid every keyword, pattern, and heuristic in FIE's detection code.

Novel attack categories tested:

Virtualization — temporal displacement, theatrical framing
Indirect Injection — annotation, footnote, and template delivery
Multilingual — Welsh, Finnish, Swahili, romanised scripts
Many-Shot — professional context framing with no trigger vocabulary
Results on those 200 unseen prompts:

Configuration Recall
─────────────────────────────────
PAIR classifier alone 11%
Full FIE (11 layers) 14.5%
Ten extra detection layers bought 3.5% recall on genuinely unknown attacks.

The central finding: architectural complexity does not confer generalisation.

What Actually Fixed It
The 169 prompts that slipped through became training data. I retrained the PAIR semantic classifier on the missed attacks as hard positives with 5× sample weight, then ran a threshold sweep from 0.50 to 0.90 to find the operating point where TPR ≥ 60% and FPR ≤ 15% held simultaneously.

That point: t = 0.80

Metric PAIR v2 PAIR v3
────────────────────────────────────────────
Novel attack recall 8–24% 96.25%
Precision 97.2% 97.22%
F1 0.808 0.967
Same architecture. Different training data. Completely different generalisation.

The Takeaway for Developers
If you're building any classifier that needs to hold up against adversarial inputs:

Benchmark on held-out data designed to evade you — not just standard datasets
Retrain on what you miss — hard-positive training beats adding more layers
Calibrate your threshold empirically — the default operating point is rarely the right one
The system is open source. Full research paper on Zenodo.

pip install fie-sdk
GitHub: github.com/AyushSingh110/Failure_Intelligence_System
Paper: doi.org/10.5281/zenodo.20536639

Have you ever stress-tested your security layer against attacks it hasn't seen? What broke first?

3 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Ayush Singh

1.9k Points • 49 Badges

India • github.com/AyushSingh110

13Posts

21Comments

11Connections

AI and data science undergrad student exploring new technologies and doing research on the models to... Show more

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

SuMiTa · Answer 1 · 2026-06-06T06:18:23+0000

SuMiTa • Jun 6

This is why I’m always skeptical of benchmark heavy evaluations. Real attackers rarely behave like test datasets. Did any of the failures surprise you the most?

Ayush_SIngh • Jun 6

@[sumita] Multilingual was the most surprising zero detection on Welsh, Finnish, Swahili. Not low, literally zero. The specialist layer had no coverage outside its training languages, and even the semantic model couldn't bridge the gap. That one I didn't see coming. Everything else had at least partial signal. That category just disappeared completely.

SuMiTa • Jun 6

@[Ayush_SIngh] Nice , thanks for explanation in details

	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20
	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	Your AI Agent Skills Have a Version Control Problem snapsynapseverified - Apr 22
	Architecting a Local-First Hybrid RAG for Finance Pocket Portfolio - Feb 25

My 11-Layer LLM Defense Looked Amazing on Benchmarks. Reality Had Other Plans.

3 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Your AI Agent Skills Have a Version Control Problem

Architecting a Local-First Hybrid RAG for Finance

More From Ayush_SIngh

Are orchestration frameworks for production or just for getting started?

Not All Repair Helps: What I Learned Trying to Fix a Failing AI Agent

Your LLM guardrail speaks English. Your attacker doesn't.

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,779 amazing developers

Don't have an account? Sign up

OR

My 11-Layer LLM Defense Looked Amazing on Benchmarks. Reality Had Other Plans.

3 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Ayush_SIngh

Related Jobs

Commenters (This Week)