I thought my LLM hit 36.9% on a cybersecurity benchmark. It was picking C every time.

Question

I thought my LLM hit 36.9% on a cybersecurity benchmark. It was picking C every time.

Joe Munene posted May 11 3 min read

I've been building GhostLM — a decoder-only transformer trained from scratch in PyTorch on cybersecurity data. No pretrained weights, every component hand-written. A few weeks ago I published a post about fixing a leaky train/val split that made my model look 14x better than it was.
I just found a second bug. This one made the benchmark itself look meaningful.
The 36.9% that wasn't
After months of training, I had a chat-tuned model hitting 36.9% on CTIBench MCQ — a 2,500-question cybersecurity multiple-choice benchmark. That's above random (25%) and I was running ablations trying to reproduce and beat it.
Then I checked the gold label distribution.
gold=A: 374 (15.0%)
gold=B: 813 (32.5%)
gold=C: 928 (37.1%) ← skewed
gold=D: 385 (15.4%)
A model that always picks C scores 37.1%. Higher than my "best" result.
I ran a quick diagnostic on the canonical checkpoint. Per-letter accuracy:
gold=A: 7/374 (1.9%)
gold=B: 0/813 (0.0%)
gold=C: 915/928 (98.6%)
gold=D: 0/385 (0.0%)
The model wasn't doing cybersecurity reasoning. It was emitting C on every question and scoring off the label distribution.
The permutation test
The correct eval is multi-permutation text scoring. Instead of scoring log P("C" | prompt), score log P(option_text | prompt) under N different option orderings. A pure C-emitter collapses to 25% random under this metric regardless of label distribution.
I ran 4 option permutations × 2,500 questions across every checkpoint:
Modelsingle-orderdebiasedall-perm correctchat-v3 "canonical"36.9%30.5%0/2500chat-v3 repro231.2%31.7%3/2500chat-v0629.8%31.2%0/2500
The model I'd dismissed as a "failed reproduction" (31.2% single-order) answered 3 questions correctly under all permutations. The "canonical" answered zero. The single-order ranking is inverted from real capability.
I also found that the bias letter depends on the base checkpoint, not the SFT recipe. v0.4 base learned to emit C. v0.6 base learned to emit B. v0.6 hybrid learned to emit A and scored 37.1% under a different permutation — same trick, different letter.
What the models actually know
After switching to text scoring, every checkpoint clustered at 27-29% on the full 2,500-question bench — roughly 4 points above random. I also built a 50-question free-form fact recall set with substring grading:

v0.4 chat: 0/50
v0.7 chat: 1/50 (spurious — "Injection" appeared in tangent prose)
v0.9 chat: 1/50 (spurious — echoed "SHA-256" from the question)

The models have absorbed the register and vocabulary of cybersecurity writing. They do not know the facts. EternalBlue gets a wrong CVE. MITRE technique IDs get conflated. The model is a cybersec parrot.
Contamination audit
After discovering v0.9 was trained on PRIMUS corpus that overlaps with CTIBench sources, I ran an 8-gram shingle overlap check. 11% of CTIBench questions have at least one overlap with the training corpus. The contaminated questions score 2.2pp lower than clean ones — contamination is confusing the model, not helping it. This rules out the leakage-helps hypothesis and points to register-shift as the explanation.
What I built while diagnosing this
The diagnosis forced me to build better infrastructure:

eval_text_scoring.py — multi-permutation text scoring, works on any MCQ JSONL
eval_fact_recall_v2.py — free-form fact recall with boundary matching and disqualifier phrases
audit_ctibench_contamination.py — shingle overlap checker
GhostBench — a packaged eval suite with Wilson 95% CIs, McNemar's tests, forest plots, scaling-law projections. Can evaluate any small open LM, not just GhostLM.

I also published the fact-recall benchmark as a HuggingFace dataset: Ghostgim/cybersec-fact-recall.
The actual bottleneck
The RAG diagnostic told the story clearly:
MetricScoreRetrieval@4 (no LM)41/100v0.9 bare1/100v0.9 + RAG0/100
The retriever surfaces the right passage 41% of the time. The 81M model extracts the fact 1% of the time. Adding retrieved context destabilizes the model into repetition loops. The bottleneck is generation capacity, not retrieval. Parameter scaling is the answer.
SmolLM2-360M and Phi-3.5-mini both show factual recall emerging around 300-400M params. Ghost-base (~360M) is the next rung, gated on GPU compute.
What I actually learned
Benchmark quality compounds more than model quality at small scale. Fixing the eval methodology revealed I'd been chasing a number that was measuring positional bias, not capability. Months of ablations were optimizing the wrong objective.
The correct eval protocol for MCQ at small scale: text scoring + multi-permutation debiasing. Single-order letter scoring is uninterpretable on any benchmark with skewed gold label distribution.
The 30% real ceiling across every architecture, BPE, SFT objective, and corpus density I tried is the clearest result I have. It's reproducible, it's controlled, and it tells me exactly what the next experiment needs to be.
GitHub: https://github.com/joemunene-by/GhostLM
Eval harness: scripts/eval_text_scoring.py
Fact-recall dataset: huggingface.co/datasets/Ghostgim/cybersec-fact-recall

2 Comments

chevron_left

BlackSpecter · Answer 1 · 2026-05-12T16:41:28+0000

BlackSpecter • May 12

The 36.9% by accident part hurts lol. Makes me wonder how many benchmark scores online are hiding similar issues.

Joe Munene • May 12

@[BlackSpecter] Probably more than people admit. The CTIBench skew (37% C answers) isn't subtle, it's checkable in 5 lines of code. The fact that it went unnoticed long enough for results to get cited says something about how rarely people audit their benchmarks before reporting numbers. The permutation test I describe is ~20 lines of Python and catches this class of problem entirely. Would be worth running on any MCQ benchmark before publishing.

	I Thought My Contact Form Was Working — It Wasn’t JayCode - Apr 1
	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9
	Your AI Doesn't Just Write Tests. It Runs Them Too. Kevin Martinez - May 12
	What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and Joe Munene - May 2
	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20

I thought my LLM hit 36.9% on a cybersecurity benchmark. It was picking C every time.

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I Thought My Contact Form Was Working — It Wasn’t

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Your AI Doesn't Just Write Tests. It Runs Them Too.

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

More From Joe Munene

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

I found a bug that made my LLM look 14x better than it was — here's what I learned

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,345 amazing developers

Don't have an account? Sign up

OR

I thought my LLM hit 36.9% on a cybersecurity benchmark. It was picking C every time.

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I Thought My Contact Form Was Working — It Wasn’t

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Your AI Doesn't Just Write Tests. It Runs Them Too.

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

More From Joe Munene

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

I found a bug that made my LLM look 14x better than it was — here's what I learned

Related Jobs

Commenters (This Week)