My 11-Layer LLM Defense Looked Amazing on Benchmarks. Reality Had Other Plans.

Leader 1 8 21
calendar_today agoschedule1 min read

Most security systems are evaluated on attacks they have already seen. I decided to test mine on ones it hadn't.
The Setup
I built FIE "an open-source adversarial prompt detector for LLMs". 11 detection layers run in parallel on every incoming prompt: regex patterns, a DeBERTa classifier, a semantic PAIR model, GCG entropy scoring, multilingual checks, and more.

On standard benchmarks (AdvBench, JailbreakBench, HarmBench — 2,006 prompts total): Precision 97.5%, F1 0.787.

I thought the system was solid. Then I stress-tested it against attack patterns it had never seen.

The Result Nobody Wants to See
I built UnknownBench — 200 novel attack prompts across 4 categories specifically crafted to avoid every keyword, pattern, and heuristic in FIE's detection code.

Novel attack categories tested:

Virtualization — temporal displacement, theatrical framing
Indirect Injection — annotation, footnote, and template delivery
Multilingual — Welsh, Finnish, Swahili, romanised scripts
Many-Shot — professional context framing with no trigger vocabulary
Results on those 200 unseen prompts:

Configuration Recall
─────────────────────────────────
PAIR classifier alone 11%
Full FIE (11 layers) 14.5%
Ten extra detection layers bought 3.5% recall on genuinely unknown attacks.

The central finding: architectural complexity does not confer generalisation.

What Actually Fixed It
The 169 prompts that slipped through became training data. I retrained the PAIR semantic classifier on the missed attacks as hard positives with 5× sample weight, then ran a threshold sweep from 0.50 to 0.90 to find the operating point where TPR ≥ 60% and FPR ≤ 15% held simultaneously.

That point: t = 0.80

Metric PAIR v2 PAIR v3
────────────────────────────────────────────
Novel attack recall 8–24% 96.25%
Precision 97.2% 97.22%
F1 0.808 0.967
Same architecture. Different training data. Completely different generalisation.

The Takeaway for Developers
If you're building any classifier that needs to hold up against adversarial inputs:

Benchmark on held-out data designed to evade you — not just standard datasets
Retrain on what you miss — hard-positive training beats adding more layers
Calibrate your threshold empirically — the default operating point is rarely the right one
The system is open source. Full research paper on Zenodo.

pip install fie-sdk
GitHub: github.com/AyushSingh110/Failure_Intelligence_System
Paper: doi.org/10.5281/zenodo.20536639

Have you ever stress-tested your security layer against attacks it hasn't seen? What broke first?

1.6k Points30 Badges1 8 21
9Posts
15Comments
8Followers
9Connections
AI and data science undergrad student exploring new technologies and doing research on the models to make them more reliable and to make sure that there is no wrong output get deli... Show more
Build your own developer journey
Track progress. Share learning. Stay consistent.

3 Comments

1 vote
1
1
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolio - Apr 1

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Ken W. Algerverified - Jun 4

Your AI Agent Skills Have a Version Control Problem

snapsynapseverified - Apr 22

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolio - Feb 25
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

3 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!