What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

Question

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

Joe Munene posted May 2 7 min read

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Data Rewrites, and One Capacity Ceiling

I started building GhostLM thinking the hard part would be the transformer architecture. I was wrong. The architecture took one day. The real work — data quality, evaluation design, finding capacity limits — took three weeks and four complete training runs to get right.

This is the honest story of that process.

What GhostLM Is

GhostLM is an open-source, decoder-only transformer language model built entirely from scratch in PyTorch, specialized for cybersecurity reasoning. No pretrained weights, no AutoModel.from_pretrained. Every component — attention, positional encoding, training loop, data pipeline — written by hand.

The current canonical model is ghost-tiny: 14.7M parameters, 2 layers, trained on CVE vulnerability descriptions, MITRE ATT&CK, CTFtime writeups, CAPEC attack patterns, and arXiv security papers.

GitHub: https://github.com/joemunene-by/GhostLM

Phase 1: The Architecture Was the Easy Part

The transformer itself came together in a single session. Causal self-attention with manual scaled dot-product, pre-norm blocks, weight-tied output projection, cosine LR schedule with linear warmup. 10/10 unit tests passing on day one.

def forward(self, x):
    B, T, C = x.size()
    qkv = self.c_qkv(x)
    q, k, v = qkv.split(self.n_heads * self.head_dim, dim=-1)
    q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
    k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
    v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
    att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))
    att = att.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float("-inf"))
    att = F.softmax(att, dim=-1)
    y = self.attn_dropout(att) @ v
    y = y.transpose(1, 2).contiguous().view(B, T, C)
    return self.resid_dropout(self.proj(y))

First training run: 500 steps on a ThinkPad Yoga 11e with 4GB RAM. Loss dropped from 10.04 to 6.27. The model was already producing CVE-style fragments: "allows remote attackers to cause a denial of service via arbitrary SQL commands."

That felt like success. It wasn't — not yet.

Phase 1 Problem: The Train/Val Split Was Leaking

The first serious mistake: I split training and validation data by random shuffle. That sounds fine until you realize CVE descriptions are highly repetitive. The same vulnerability pattern appears in dozens of CVEs with slight wording variations.

A random split means nearly identical records end up in both train and val. The model memorizes patterns it's already seen in training, val loss looks great, and you think you're making progress. You're not — you're measuring memorization.

Phase 1 val_loss: 2.74 at 10K steps. Looked great. Was fake.

The fix was content-hash bucketing:

# Same text always lands in the same split, deterministically
bucket = int(hashlib.sha256(text.encode()).hexdigest(), 16) % 100
split = "val" if bucket < val_pct else "train"

After fixing the split, val_loss jumped to 3.78 on the same architecture and training recipe. That jump wasn't the model getting worse — it was the eval getting honest.

Lesson 1: Always audit your train/val split. Leakage is silent and optimistic.

Phase 2 Problem: NVD Was 87% of the Corpus

With a clean split, Phase 2 trained to 10K steps on what I thought was a diverse cybersecurity corpus. 10,925 records. NVD CVEs, synthetic CTF writeups, synthetic papers.

What I hadn't measured: token share. NVD CVE descriptions are short (avg ~200 chars) and there are thousands of them. The synthetic writeups are long (~800 chars each) but there were only 500.

Running scripts/data_stats.py revealed the actual situation:

NVD CVE:     87.3% of tokens
Synthetic:    7.8%
Papers:       4.9%

The model wasn't learning "cybersecurity language." It was learning "CVE description language" — one very specific register of one very specific document type.

Every prompt, regardless of domain, produced CVE-style output:

CTF prompt → fake CVE
MITRE ATT&CK prompt → fake CVE
Research paper prompt → fake CVE

The model had one output register and applied it to everything.

Lesson 2: Measure token share, not record count. A dataset with 6 sources and 87% from one source has 1 effective source.

Phase 3: Real Data From Real Sources

Phase 3 added three new real data sources:

MITRE ATT&CK — pulled via STIX2 JSON from GitHub. 691 technique descriptions covering adversary tactics, techniques, and procedures.

CAPEC — XML pull from MITRE covering attack patterns. 609 records with structured attack methodology descriptions.

CTFtime real writeups — 467 inline writeups from actual CTF events, attributed and licensed for research use.

These three sources brought the token distribution to:

NVD CVE:       65.3% (capped at 6M tokens via content-hash subsample)
Synthetic CTF: 17.2%
arXiv cs.CR:    8.4%
CTFtime real:   5.3%
MITRE ATT&CK:   2.9%
CAPEC:          0.9%

The NVD cap uses deterministic content-hash subsampling — same 71,828-record prefix every rebuild, so train/val splits stay byte-identical across runs:

python3 scripts/rebuild_corpus.py --max-cve-tokens 6000000

Trained ghost-tiny for 30,000 steps on the rebalanced corpus. The per-source perplexity results were striking:

Source	Before (v0.3.3)	After (v0.3.5)	Change
MITRE ATT&CK	615	55	−91%
CTFtime writeups	184	61	−67%
CAPEC	326	134	−59%
Synthetic CTF	68	28	−58%
arXiv	671	355	−47%
NVD CVE	24	28	+14%

Every diversity source improved dramatically. NVD paid a small cost (+14%) — expected, since it now shares parameter capacity with five other domains. That's the right tradeoff.

The behavioral change was visible. Same model size, same training recipe, different corpus balance:

Before (v0.3.3):

Prompt: MITRE ATT&CK technique T1003
Output: CVE-2019-XXXX: A vulnerability in [product] allows remote attackers to...

After (v0.3.5):

Prompt: MITRE ATT&CK technique T1003
Output: T1003.011: defense-evasion. Tactic: defense-evasion. Adversaries may use-evasion, such as legitimate system-evasion...

That's actual MITRE schema output — sub-technique ID format, "Tactic:" header, the standard "Adversaries may..." opening. The model learned the MITRE register because it finally had MITRE training data in meaningful quantity.

Lesson 3: Model behavior follows token distribution. Want register diversity? Give your corpus source diversity at the token level.

The Evaluation Was Broken Too

While fixing the data, I discovered the evaluation was also broken.

The security task evaluation asked the model to classify inputs into categories (CVE severity, vulnerability type, MITRE tactic, etc.) by scoring each candidate label's log-probability. Every model across every phase reported 13.3% accuracy — exactly 4/30, exactly at the random baseline.

That's not the model performing poorly. That's a mode-collapsed eval. The model was assigning its highest probability to the same single label for every sample in each task — the most common token sequence — regardless of the input.

The fix was PMI (Pointwise Mutual Information) scoring:

# Instead of: score = log P(label | context)
# Use: score = log P(label | context) - log P(label)
pmi_score = conditional_logprob - unconditional_logprob

This subtracts the model's prior preference for each label, so common labels don't automatically win. After switching to PMI scoring, the eval could finally discriminate between models:

Phase	PMI Accuracy	vs Random (14.5%)
Phase 1	20%	+5.5 pp
Phase 3.5	31.2%	+16.7 pp

Not impressive in absolute terms — ghost-tiny at 14.7M params isn't going to nail classification tasks. But it's real signal, not evaluation noise.

Lesson 4: If every model in every phase scores the same on your eval, the eval is broken. Check for mode collapse before concluding the model isn't learning.

Phase 3.6: Finding the Capacity Ceiling

With a working eval and a clean corpus, I tried to push further. Phase 3.6 added Exploit-DB (~3.77M tokens, 30% of the new corpus) and re-trained ghost-tiny at the same 30K-step recipe.

The results were a clean regression:

Task	Phase 3.5	Phase 3.6	Change
CVE Severity	32%	16%	−16 pp
Vulnerability Type	32%	12%	−20 pp
Attack Technique	40%	16%	−24 pp
CTF Categorization	40%	20%	−20 pp
Overall	31.2%	16.8%	−14.4 pp

Per-source perplexity confirmed the diagnosis: every existing source got 28–42% worse while Exploit-DB itself landed well (PPL 40.87). The model learned Exploit-DB at the expense of everything else. Parameter capacity was reallocated, not expanded.

At 14.7M parameters and 30K training steps, ghost-tiny has hit its ceiling. More data at this model size produces diminishing returns — and eventually, regression.

Lesson 5: Parameter capacity is the binding constraint. When adding data hurts existing domains, you've hit the model's capacity wall. The fix is more parameters, not more data.

Where GhostLM Is Now

v0.3.5 (current canonical):

30,000 training steps
74,635 records, 8.8M tokens across 6 sources
Cyber-text perplexity: 96.24 (vs GPT-2's 26.76 — ~8× less capacity, ~3.6× behind)
PMI security task accuracy: 31.2% (vs 14.5% random)
Register diversity: switches between CVE, MITRE, and CTF voice depending on prompt

What's next:

ghost-small (55M params) — the first scale-up rung, targeting M4 GPU/MPS
Corpus volume expansion: CTFtime archives, security research blogs, full-text papers
Applied for Google TPU Research Cloud credits

The honest framing: ghost-tiny is a learning artifact and a working pipeline. It is not a useful cybersecurity AI tool. ghost-small is where domain-coherent generation might start to emerge. ghost-base (~350M) is where it gets genuinely useful. That's the realistic roadmap.

What I'd Tell Someone Starting This

1. Build the data pipeline before the model. The architecture is a solved problem — transformers are well understood. What's not solved for your domain is the data. Start there.

2. Measure token share, not record count. Diversity in records doesn't mean diversity in training signal if one source dominates token count.

3. Fix your evaluation before trusting it. Mode-collapsed evals are optimistic and useless. PMI scoring, per-source perplexity, and fixed external test sets are more honest than aggregate val_loss.

4. Set a capacity budget and test it. Before adding more data, test whether your model can absorb it without regressing on existing domains. A simple per-source perplexity check after each training run tells you whether you're within capacity.

5. Be honest about what your model can't do. A 14.7M parameter model hallucinating at temperature 0.7 is not a cybersecurity tool. Label it accurately. The trajectory toward useful matters more than the current absolute performance.

Try It Yourself

git clone https://github.com/joemunene-by/GhostLM.git
cd GhostLM
make install
make data
make train-tiny
make chat

The codebase is designed to be readable — every component is hand-written with docstrings, and the architecture is clean enough to use as a reference for how a transformer actually fits together.

Contributions welcome — especially around corpus expansion, evaluation design, and architecture improvements (RoPE and Flash Attention are already in, SwiGLU and grouped query attention are next candidates).

GitHub: https://github.com/joemunene-by/GhostLM
License: MIT

Built in Nairobi, Kenya

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	Architecting a Local-First Hybrid RAG for Finance Pocket Portfolio - Feb 25
	The Privacy Gap: Why sending financial ledgers to OpenAI is broken Pocket Portfolio - Feb 23
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules snapsynapseverified - Apr 20

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Data Rewrites, and One Capacity Ceiling

What GhostLM Is

Phase 1: The Architecture Was the Easy Part

Phase 1 Problem: The Train/Val Split Was Leaking

Phase 2 Problem: NVD Was 87% of the Corpus

Phase 3: Real Data From Real Sources

The Evaluation Was Broken Too

Phase 3.6: Finding the Capacity Ceiling

Where GhostLM Is Now

What I'd Tell Someone Starting This

Try It Yourself

0 Comments

Please log in to comment on this post.

More Posts

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Architecting a Local-First Hybrid RAG for Finance

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

More From Joe Munene

I thought my LLM hit 36.9% on a cybersecurity benchmark. It was picking C every time.

I found a bug that made my LLM look 14x better than it was — here's what I learned

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,345 amazing developers

Don't have an account? Sign up

OR

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Architecture Rewrites, and

What I Learned Building a Cybersecurity LLM From Scratch: 30,000 Steps, 3 Data Rewrites, and One Capacity Ceiling

What GhostLM Is

Phase 1: The Architecture Was the Easy Part

Phase 1 Problem: The Train/Val Split Was Leaking

Phase 2 Problem: NVD Was 87% of the Corpus

Phase 3: Real Data From Real Sources

The Evaluation Was Broken Too

Phase 3.6: Finding the Capacity Ceiling

Where GhostLM Is Now

What I'd Tell Someone Starting This

Try It Yourself

0 Comments

Please log in to comment on this post.

More Posts

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Architecting a Local-First Hybrid RAG for Finance

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

More From Joe Munene

I thought my LLM hit 36.9% on a cybersecurity benchmark. It was picking C every time.

I found a bug that made my LLM look 14x better than it was — here's what I learned

Related Jobs

Commenters (This Week)