I've been building GhostLM — a decoder-only transformer trained from scratch in PyTorch on cybersecurity data. No pretrained weights, every component hand-written. A few weeks ago I published a post about fixing a leaky train/val split that made my model look 14x better than it was.
I just found a second bug. This one made the benchmark itself look meaningful.
The 36.9% that wasn't
After months of training, I had a chat-tuned model hitting 36.9% on CTIBench MCQ — a 2,500-question cybersecurity multiple-choice benchmark. That's above random (25%) and I was running ablations trying to reproduce and beat it.
Then I checked the gold label distribution.
gold=A: 374 (15.0%)
gold=B: 813 (32.5%)
gold=C: 928 (37.1%) ← skewed
gold=D: 385 (15.4%)
A model that always picks C scores 37.1%. Higher than my "best" result.
I ran a quick diagnostic on the canonical checkpoint. Per-letter accuracy:
gold=A: 7/374 (1.9%)
gold=B: 0/813 (0.0%)
gold=C: 915/928 (98.6%)
gold=D: 0/385 (0.0%)
The model wasn't doing cybersecurity reasoning. It was emitting C on every question and scoring off the label distribution.
The permutation test
The correct eval is multi-permutation text scoring. Instead of scoring log P("C" | prompt), score log P(option_text | prompt) under N different option orderings. A pure C-emitter collapses to 25% random under this metric regardless of label distribution.
I ran 4 option permutations × 2,500 questions across every checkpoint:
Modelsingle-orderdebiasedall-perm correctchat-v3 "canonical"36.9%30.5%0/2500chat-v3 repro231.2%31.7%3/2500chat-v0629.8%31.2%0/2500
The model I'd dismissed as a "failed reproduction" (31.2% single-order) answered 3 questions correctly under all permutations. The "canonical" answered zero. The single-order ranking is inverted from real capability.
I also found that the bias letter depends on the base checkpoint, not the SFT recipe. v0.4 base learned to emit C. v0.6 base learned to emit B. v0.6 hybrid learned to emit A and scored 37.1% under a different permutation — same trick, different letter.
What the models actually know
After switching to text scoring, every checkpoint clustered at 27-29% on the full 2,500-question bench — roughly 4 points above random. I also built a 50-question free-form fact recall set with substring grading:
v0.4 chat: 0/50
v0.7 chat: 1/50 (spurious — "Injection" appeared in tangent prose)
v0.9 chat: 1/50 (spurious — echoed "SHA-256" from the question)
The models have absorbed the register and vocabulary of cybersecurity writing. They do not know the facts. EternalBlue gets a wrong CVE. MITRE technique IDs get conflated. The model is a cybersec parrot.
Contamination audit
After discovering v0.9 was trained on PRIMUS corpus that overlaps with CTIBench sources, I ran an 8-gram shingle overlap check. 11% of CTIBench questions have at least one overlap with the training corpus. The contaminated questions score 2.2pp lower than clean ones — contamination is confusing the model, not helping it. This rules out the leakage-helps hypothesis and points to register-shift as the explanation.
What I built while diagnosing this
The diagnosis forced me to build better infrastructure:
eval_text_scoring.py — multi-permutation text scoring, works on any MCQ JSONL
eval_fact_recall_v2.py — free-form fact recall with boundary matching and disqualifier phrases
audit_ctibench_contamination.py — shingle overlap checker
GhostBench — a packaged eval suite with Wilson 95% CIs, McNemar's tests, forest plots, scaling-law projections. Can evaluate any small open LM, not just GhostLM.
I also published the fact-recall benchmark as a HuggingFace dataset: Ghostgim/cybersec-fact-recall.
The actual bottleneck
The RAG diagnostic told the story clearly:
MetricScoreRetrieval@4 (no LM)41/100v0.9 bare1/100v0.9 + RAG0/100
The retriever surfaces the right passage 41% of the time. The 81M model extracts the fact 1% of the time. Adding retrieved context destabilizes the model into repetition loops. The bottleneck is generation capacity, not retrieval. Parameter scaling is the answer.
SmolLM2-360M and Phi-3.5-mini both show factual recall emerging around 300-400M params. Ghost-base (~360M) is the next rung, gated on GPU compute.
What I actually learned
Benchmark quality compounds more than model quality at small scale. Fixing the eval methodology revealed I'd been chasing a number that was measuring positional bias, not capability. Months of ablations were optimizing the wrong objective.
The correct eval protocol for MCQ at small scale: text scoring + multi-permutation debiasing. Single-order letter scoring is uninterpretable on any benchmark with skewed gold label distribution.
The 30% real ceiling across every architecture, BPE, SFT objective, and corpus density I tried is the clearest result I have. It's reproducible, it's controlled, and it tells me exactly what the next experiment needs to be.
GitHub: https://github.com/joemunene-by/GhostLM
Eval harness: scripts/eval_text_scoring.py
Fact-recall dataset: huggingface.co/datasets/Ghostgim/cybersec-fact-recall