We Built AI Verification Infrastructure. Then It Found Our Blind Spots.

Question

We Built AI Verification Infrastructure. Then It Found Our Blind Spots.

calendar_todayJun 2 • schedule9 min read

Prologue: The Mirage & The Pivot

In the summer of 2024, our team began with an ambition that felt almost impossible: to use frontier AI for drug discovery, theoretical physics, and problems that seemed unreachable. We imagined systems that could reason about spacetime geometries and propose novel small-molecule bindings.

Within the first month, the boundary became clear. The models were powerful, but their claims were not something we could responsibly stand behind unless the path from input to result could be inspected.

So we made a choice. We could keep producing evocative claims we could not verify, or we could build infrastructure that would force our own work through the same scrutiny we would apply to anyone else's.

We stopped. We built.

For the last two years, we pursued this work without revenue. The original ambition was larger than the system we are publishing now, but the work became more precise. We began turning model-generated reasoning into executable Python harnesses, custody records, and governance gates before we had stable language for what to call them.

The practical question became narrower: can scientific claims expose the path by which they were generated, and can biomolecular AI systems be evaluated for transparency before their outputs are treated as evidence?

We shared progress publicly on dev.to, LinkedIn, Substack, Medium, and CoderLegion, usually one to three times a week. Most posts did not travel far. They were part research log, part signal flare, and part attempt to decide what should remain protected and what should be released.

Over time, the answer became clearer: protect the sensitive core, but publish verification surfaces, audit artifacts, and reproducible paths wherever possible.

We are not claiming to have solved drug discovery or the Theory of Everything. We are publishing the verification materials because even a small piece of reliable infrastructure may help others near those frontiers.

What the Flamehaven Verification Ledger Is

That choice — to verify before we believe — became the whole of what we built. Everything in the Flamehaven Verification Ledger answers one question:

Can this claim show the path by which it came to exist?

A mathematical or scientific claim earns operational trust only once the path behind it can be inspected, reproduced, or challenged.

A biomolecular AI pipeline is trusted only when its disagreements are surfaced, not smoothed over.

A bioscience repository is cleared only once its safety is checked line by line, with no model in the scoring loop.

Verification is not a feature here. It is the spine.

The ledger is public and inspectable by design. Each run is published as a bounded artifact: inputs, a score, a report, a traceable custody path, and, when it happens, a failure note.

Today it has three main verification lanes:

EQA — Equation-to-Artifact. Mathematical and physical claims are turned into runnable, deterministic Python checks, including string-theory beta-function tests at 200-bit precision and the OpenAI Erdos reproduction.
BAV — Biomolecular AI Validation. Biomolecular pipelines are evaluated as governed systems, not just prediction engines. Model disagreement is treated as a safety signal rather than noise.
BSC — Bioscience Compliance. The open-source STEM-BIO-AI scanner runs local, zero-execution safety and compliance scans of bioscience repositories and maps findings to external risk and traceability frameworks, with no LLM in the scoring path.
Open resources are rolling out as they are cleared for public release.

Public ledger

What Two Years Produced — and What It Cost Our Beliefs

This is the part that matters. Two years of runs left a body of results: some held up under scrutiny, and some broke beliefs we were attached to.

A verification engine earns credibility from both: results that survive and failures that tell us when we are wrong.

Success 1: Geometry Overruled the Narrative

The EQA lane showed something important in TOE-TEST-0004: it judged a background by its geometry, not by the words attached to it.

The test was built around Green-Schwarz anomaly cancellation, where the gravitational and gauge Pontryagin densities must match:

p1_R = p1_F

Two adversarial cases made the point.

A Schwarzschild black hole carrying a single-plane gauge field sounds like an anomaly waiting to happen. The engine returned PASS (Omega = 0.9997) because Schwarzschild is Ricci-flat (p1_R = 0) and a single-plane field is topologically trivial (p1_F = 0).

Flat 10-D Minkowski space sounds harmless. But with a two-plane gauge field, it returned FAIL:

p1_F = 2.0 != p1_R = 0

Without curvature, there is nothing to cancel the gauge anomaly. The ledger did not reward the background for sounding safe, nor punish it for sounding dangerous. It ran the case.

Success 2: Calibration Is Not Understanding

EXP-028 was built to catch a different confusion: a model can be beautifully calibrated and still not understand what it is looking at.

On the surface, the model looked excellent. Brier score improved from 0.204 to 0.0056 after calibration. AUC was 1.0.

By standard calibration metrics, this is the kind of model you might trust. But the cross-domain honesty gate refused it.

SR9 was about 0.26 against a gate of >= 0.80. DI2 was about 0.61 against a gate of <= 0.20.

So the pipeline did the rare thing: it returned "I cannot resolve this" and abstained, rather than let a well-calibrated number masquerade as comprehension.

Calibration measures whether confidence is honest about frequency. It does not prove understanding.

*SR9 and DI2 are advisory heuristics, not externally validated physical quantities. That boundary matters.

Success 3: A Repository Scored Against a Traceability Rubric, Deterministically

The third success is the least glamorous and the most reusable: a real bioscience repository scored against external traceability concerns by code anyone can re-run.

Our open-source scanner STEM-BIO-AI (pip install stem-ai) reads observable signals and grades three rubrics:

S1 = README / intent evidence
S2 = repo-local consistency
S3 = code / bio responsibility

They combine into a deterministic weighted score, with penalties and hard caps where needed:

raw = round(0.4*S1 + 0.2*S2 + 0.4*S3 - risk_penalty)

For yorkeccak/bio, the scan produced a final score of 48, mapped to T1 Quarantine.

The point is how the number is reached: README intent, repo-local consistency, dependency safety, exception handling, data provenance, clinical disclaimers, and AST-level code analysis using ast.parse.

Nothing in the target repo is executed. No network, no GPU, no model in the loop. The audit runs locally and reproduces from repository state.

The Failures We Are Publishing

Failure 0: Synthetic Data in Our Own Ledger

The most important failure was not in physics or biology. It was in our own record.

An early EQA dashboard shipped a "51-run calibration registry" that read as authoritative until an external reviewer showed it was procedurally generated.

It contained fabricated primes, fabricated field degrees, synthetic hash labels, and at least one failed check beside a PASS verdict. It was hallucinated scaffolding, not computation.

We deleted the synthetic registry, replaced it with real TOE-TEST foundational runs, sanitized local workspace paths, and added a deterministic synthetic_marker detector so a [synthetic] tag cannot re-enter the public ledger.

This is the failure we are most willing to show because it is what the system was built to catch — including when the fabricator was us.

Failure 1: One-Loop Weyl Curvature Blindness

The EQA engine evaluates spacetime geometries under one-loop worldsheet beta-functions of the non-linear sigma-model.

We stress-tested it with a "Planck-scale spacetime foam Schwarzschild metric with mass M = 0.1," expecting rapid failure due to extreme curvature near the horizon.

Instead, the solver returned PASS:

Omega = 0.9985

The reason is precise. Schwarzschild is Ricci-flat outside the singularity:

R_mn = 0

Our one-loop beta-function gate couples directly to Ricci curvature, not Weyl curvature. So the metric residuals vanished even though tidal curvature was physically severe.

The Kretschmann scalar made the blind spot visible:

K = R_abcd * R^abcd = 48 * M^2 / r^6 = 4800

This is not a broken engine. It is a specific boundary. The solver is correct where one-loop Ricci geometry is sufficient. T09 marks where that sufficiency ends.

We have not fixed this. The correction requires higher-order alpha-prime terms for Ricci-flat backgrounds with high tidal forces.

Failure 2: An Honest Rejection We Cannot Yet Confirm

In EXP-005, our internal candidate-generation engine screened three lipid carriers for topical Upadacitinib: SLN, NLC, and liposomal gel.

Each was scored by SR9, an advisory cross-domain consistency heuristic. Our bar was >= 0.80.

The scores were far below that bar:

SLN ~= 0.28
NLC ~= 0.23
liposomal gel ~= 0.26

Every candidate was rejected in under two hours, against an estimated months of bench work.

But because SR9 is advisory, we do not know whether these values reflect real formulation incompatibility or an over-conservative false reject. A fast negative is useful only if it is correct.

Failure 2b: When the Models Themselves Disagree

EXP-031 tested a 52-amino-acid target across AlphaFold2, AlphaFold3, Chai-1, and Boltz-2.

They did not converge. pTM stayed low:

AF2 pTM ~= 0.31
AF3 pTM ~= 0.40

The inter-model consensus drifted by 0.09 to 0.37.

The honest outcome was not a number. It was a refusal: the pipeline returned Unverified / Drift Detected and an observer-only decision for every arm.

The ledger's job was to refuse to launder that disagreement into a result.

Failure 3: The Multiplicative Reliability Fallacy

The BAV pipeline originally modeled end-to-end reliability as a multiplicative chain:

p_e2e = p_capture * p_transfer * p_model * p_clinical

These factors are not independent. Clinical reliability depends on model reliability, which depends on transfer and capture quality.

The correct direction is conditional:

p_e2e =
  p(clinical | model)
  * p(model | transfer)
  * p(transfer | capture)
  * p(capture)

We have not implemented this yet. Every reliability figure in the BAV lane carries this constraint.

From Experiments to Public Custody Paths

Before this infrastructure existed, experiments ended as local folders, private notes, or claims that required too much explanation. Now a weekly experiment can become a public record: a bounded run, score, report, artifact path, and failure note.

We do not see this as a replacement for peer review. We see it as the layer before peer review: the place where claims become inspectable, failures remain visible, and weekly experiments accumulate.

We also avoid claiming false determinism. Fold outputs vary with seed, GPU architecture, MSA depth, and compiler version. In EXP-031, we anchor the input FASTA, pin the seed and model versions, and record confidence metrics as a labelled reference run. The goal is regime-level comparison, not bit-exact reproduction.

The detailed repository tree is omitted here for space; the public ledger exposes the custody paths directly.

MICA and the CI Sanitizer Gate

Two layers keep the ledger honest before publication.

MICA, Memory Invocation & Context Archive for AI, is the governance memory. It encodes project invariants: internal metrics carry no external authority, fabricated data is prohibited, and public claims stay understated.

The sanitizer is the enforcement layer. It runs as a CI gate and fails the build when it finds workspace leaks, local-specific PII patterns, promotional language, or fabrication tags.

The synthetic_marker detector was added after Failure 0. The promotional_language detector flags public-facing hype such as "revolutionary," "state-of-the-art," or "authoritative."

The result is a repository that censors its own hype, leaks, and fabrications before publication — and keeps an auditable record of doing so.

Dashboard Heuristics: Labelled As Heuristics

The dashboard plots EQA metrics against an exponential curve.

Status: ADVISORY-HEURISTIC.

This curve is not a physical law. It is an empirical visual fitting heuristic for charting. Treating it as fundamental law would violate the ledger's credibility rules.

Omega(norm_R_F) = Omega_0 * exp(-lambda * norm_R_F) + epsilon_numerical

What We Need Others to Test

The ledger is not finished. It is a public audit brief.

We need formulation scientists to test whether EXP-005 reflects real carrier incompatibility or an over-conservative false reject.

We need structural biologists to test whether EXP-031 reflects genuine out-of-distribution behavior or a harness artifact.

We need string theorists to test whether TOE-TEST-0001 is physically correct in the Ricci-flat regimes where we claim results.

We need compliance experts to test whether the zero-execution Article 12 mapping in BSC is legally defensible or technically approximate.

All code, schemas, and run payloads: Zenodo DOI 10.5281/zenodo.20483364

Public Ledger: https://flamehaven01.github.io/Flamehaven-Verification-Ledger/
Citation ORCID: 0009-0009-2641-4280

Epilogue: The Invitation

We began by trying to reach the unreachable: spacetime geometries, novel small-molecule bindings, a Theory of Everything.

We did not get there. This ledger is not that.

What we found instead was smaller and more useful: a way to make a claim show the path by which it came to exist.

As AI fills every field with fluent, confident, publication-shaped output, the quiet casualty is the experiment itself: the runnable artifact, the failed run nobody deleted, the number traced back to code.

We are not announcing a solution. We are publishing a small, inspectable surface — failures included — and handing it to the people best equipped to break it.

If it proves useful against the confident noise filling everyone's feeds, that will be your verdict, not our claim.

We built this to learn from you.

2 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

kennytm · Answer 1 · 2026-06-04T03:21:02+0000

kennytm • Jun 3

Interesting read. It's funny how the tools we build to catch mistakes end up exposing our own assumptions too. Did any of the blind spots genuinely surprise your team?

Flamehaven • Jun 4

Yes — and the surprising part was that the blind spot was ours, not the system's.

We submitted a Planck-scale Schwarzschild black hole (mass M=0.1 in natural units), described it as "extreme curvature," and expected the engine to reject it. It returned PASS. We called it a divergence in the report.

Then we looked at why the engine was right. The Schwarzschild metric is a vacuum solution to the Einstein equations by definition — the Ricci tensor is exactly zero for any black hole mass, always. Our verification engine runs one-loop worldsheet beta-functions, which couple to the Ricci tensor.

Not to the Kretschmann scalar (K = 48M^2/r^6), which does blow up near the horizon and captures the tidal, physical danger of the geometry.

So the engine correctly PASSed a geometry that is, from a string-theory consistency standpoint, a valid vacuum. We had assumed "physically dramatic" and "beta-function violation" were the same thing. They aren't.

The system didn't fail to catch something. It revealed that we hadn't fully understood what our own engine could see.

	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	Your AI Doesn't Just Write Tests. It Runs Them Too. Kevin Martinez - May 12
	I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules snapsynapseverified - Apr 20
	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20
	Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x Flamehaven - May 15

We Built AI Verification Infrastructure. Then It Found Our Blind Spots.

Prologue: The Mirage & The Pivot

What the Flamehaven Verification Ledger Is

What Two Years Produced — and What It Cost Our Beliefs

Success 1: Geometry Overruled the Narrative

Success 2: Calibration Is Not Understanding

Success 3: A Repository Scored Against a Traceability Rubric, Deterministically

The Failures We Are Publishing

Failure 0: Synthetic Data in Our Own Ledger

Failure 1: One-Loop Weyl Curvature Blindness

Failure 2: An Honest Rejection We Cannot Yet Confirm

Failure 2b: When the Models Themselves Disagree

Failure 3: The Multiplicative Reliability Fallacy

From Experiments to Public Custody Paths

MICA and the CI Sanitizer Gate

Dashboard Heuristics: Labelled As Heuristics

What We Need Others to Test

Epilogue: The Invitation

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Your AI Doesn't Just Write Tests. It Runs Them Too.

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x

More From Flamehaven

No Single Key Opens the Boundary: An Offline Dual-Control Gate for Sensitive Artifact Export

Five Rules for Staying Yourself While You Talk to AI All Day

We Read a French Health-Tech Giant's Open-Source AI Pipeline Next to Its Paper. And...

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,632 amazing developers

Don't have an account? Sign up

OR

We Built AI Verification Infrastructure. Then It Found Our Blind Spots.

Prologue: The Mirage & The Pivot

What the Flamehaven Verification Ledger Is

What Two Years Produced — and What It Cost Our Beliefs

Success 1: Geometry Overruled the Narrative

Success 2: Calibration Is Not Understanding

Success 3: A Repository Scored Against a Traceability Rubric, Deterministically

The Failures We Are Publishing

Failure 0: Synthetic Data in Our Own Ledger

Failure 1: One-Loop Weyl Curvature Blindness

Failure 2: An Honest Rejection We Cannot Yet Confirm

Failure 2b: When the Models Themselves Disagree

Failure 3: The Multiplicative Reliability Fallacy

From Experiments to Public Custody Paths

MICA and the CI Sanitizer Gate

Dashboard Heuristics: Labelled As Heuristics

What We Need Others to Test

Epilogue: The Invitation

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Flamehaven

Related Jobs

Commenters (This Week)