When the Memory Gate Met a Real Archive
This article is the practical side of the MICA series. MICA means Memory Invocation and Context Archive. In this workflow, it is a small package loaded at the start of a maintenance session so the active rules are visible before any code or record is touched.
Parts 6 and 7 described the contract. This part describes what happened when that contract met a real scientific archive with more files, records, and publication surfaces than one maintainer could safely hold in memory.
The archive is the Flamehaven Verification Ledger. It publishes three lanes of evidence:
- EQA (Equation-to-Artifact): physics and math reproductions. At the time of this article, 56 records exist, from
TOE-TEST-0001 through TOE-TEST-0056.
- BAV (Biomolecular AI Validation): protein-folding and BioAI validation records. The active surface contains 6 cards, supported by a foundational archive.
- BSC (Bioscience Compliance): repository audits against external risk taxonomies such as the MIT AI Risk Repository and the EU AI Act.
Across the three lanes, the archive holds roughly 90 experiments and more than 300 files. Every public record is meant to be inspectable and citeable. That creates a different failure mode from an ordinary blog or README. If the AI maintainer drifts, the drift can become a downstream citation.
One scope boundary matters. flamehaven-audit-reports is not the engine that computes the results. It is the public evidence surface. Upstream engines and experiment repositories produce the raw artifacts. This repository ingests those artifacts, sanitizes them for publication, classifies them, and renders them as a static ledger.

1. The Archive Is Not One Thing
The archive is a layered publication system, not a single program.
The projection layer does four jobs:
- ingest upstream artifacts
- sanitize what should not be published as-is
- classify each record by evidence type
- render it through an inspection surface
The three lanes behave differently.
The EQA lane is closest to a deterministic computation archive. Each record usually has a machine-readable internal_data.json receipt and a human-readable analysis_report.md. The strongest examples include a Schwarzschild Planck-scale metric verification, a de Sitter background check, and an OpenAI Erdős Eq.(2.2) reproduction.
The BAV lane is more governance-heavy. Only one active card, EXP-031, currently carries a foldable input sequence and model comparison across AlphaFold3, AlphaFold2, Chai-1, and Boltz-2. The other active cards are governance or methodology records. They should not be presented as re-runnable fold experiments.
The BSC lane maps repository state to external compliance taxonomies.
This distinction is where the first lesson begins. A single maintainer can manually review five records. Nobody can reliably review more than 300 files by memory. Cheap slop scales with file count. Review does not.
2. The First Scar: Correct Numbers, Wrong Frame

The first failure was not numerical. It was a framing failure.
In June 2026, we audited the EQA lane. The calculations we re-ran were honest. A Schwarzschild horizon calculation returned Omega = 0.9985. A de Sitter background check returned sqrt_jsd = 0.2722. The math itself was not the problem.
The public website was the problem.
It showed 51 of 56 records with a green PASS badge. A green PASS should mean that a numerical check ran and passed a threshold. Instead, the page treated the existence of a markdown report as a pass condition. Governance notes, scenario builds, and supporting documents were displayed as if they were verified computation runs.
A reader saw "51 successful verifications." A manual review found that only 7 had come from a real engine run. The other 44 were supporting documents.
The fix was simple but important: the page headline became "7 verification runs and 44 supporting documents." A new rule was added to the package contract:
A green PASS badge can only come from a real threshold check. A report file is not a pass. A copied grade is not a fresh verdict.
This audit was manual. MICA did not catch the drift at the time. What MICA does now is preserve the lesson inside the contract and the maintainer workflow.
The validator emits CLOSED CONTRACT only when the package structure, declared layers, and rule bindings are coherent. That does not mean every semantic claim is automatically verified. It means the maintainer is operating inside the declared contract before changing the archive.
That distinction matters. The math was correct. The public frame was wrong. A syntax check would never have caught it.
3. Scientific Archives Have No Forgiveness Budget
Most LLM-assisted writing has a forgiveness budget.
A README can be slightly optimistic. A pitch deck can round a number. A blog post can overstate and later soften the claim. Readers often compensate.
A scientific archive does not have that budget.
A ledger value can become a citation. A fold metric can become a research lead. A DOI can be trusted by someone who never saw the maintenance process that produced the page.
The same model that improves prose can also invent a plausible SMILES string, a DOI, a fold metric, or a confidence value. These objects look clean in markdown. They are also exactly the objects an LLM cannot verify by itself.
That is why the archive needs a gate that loads before any code is touched.
4. The Second Scar: The Website Had Two Truths

The second failure happened inside the website.
The site once shipped a fallback copy of every record inside js/portal.js. The reason was practical. Some users open static files through file://, and browsers may block separate JSON fetches in that mode. The fallback let the page render anyway.
Over time, the disk files changed. The inline JavaScript copies did not.
A drift-checking script compared the on-disk records with their inline twins and found 151 mismatches. One record about the Erdős reproduction even had incompatible schema_id structure between the two copies.
The AI maintainer had been editing the disk files. The website had been rendering stale inline data. Both sides looked internally valid. They no longer agreed.
GOVERNANCE.md had said "single source of truth" the whole time. The policy was right. Nothing enforced it.
The fix removed the inline fallback as an evidence source:
function getFallbackReportText(runId) {
// Removed: inline report fallback drifted from on-disk reports.
// Single source of truth = on-disk files fetched above.
return '';
}
function getFallbackDataset(runId) {
// Removed: inline datasets drifted from on-disk JSON.
// The ledger must be served over HTTP, not opened as file://.
return null;
}
The rule now lives in three places: the machine-readable contract, the browser code, and the human playbook. A future maintainer sees both the prohibition and the incident that forced it.
A memory gate that lives only in markdown is etiquette. It becomes structural when the contract, code, and playbook point at one another.
5. What the Playbook Adds

The machine-readable contract tells the maintainer what must be true. The playbook tells the maintainer why the rule exists and how not to repeat the original mistake.
For example, the contract says math runs must use arbitrary precision at 200 bits or higher. That rule is precise, but blunt. The playbook records the incident behind it: an early class-field calculation where ordinary 64-bit floats silently underflowed to zero and produced a meaningless result.
The contract also says records should not be edited in place after commitment. The playbook turns that into behavior: create a new record with a new ID, then link it to the old one. Do not destroy the audit trail.
The same applies to metric labels. If a new metric renders without a label, the maintainer should not treat the missing label as harmless. The playbook says to add the metric to the glossary, decide what evidence backs it, assign the correct provenance class, then merge.
The playbook is not the rule. The contract is. The playbook is the scar tissue that makes the rule hard to forget.
6. Where MICA Sits

MICA is a small Python validator plus package format. The maintainer runs it at session start.
The script reads a short package contract. That contract names three required layers:
- the archive rule list
- the human playbook
- the credibility document that defines what internal scores may or may not appear publicly
The validator then runs 11 structural checks. These checks are intentionally small. They refuse cheap failure modes before maintenance begins.
They check whether required shape fields exist, whether declared layers are present, whether named files exist on disk, whether critical rules have an origin incident, and whether lesson references still resolve.
The sequence is simple:

If the package passes, the validator emits CLOSED CONTRACT. If a hard-fail check trips, it emits INCOMPLETE. In this workflow, the maintainer fixes that state before writing to the archive.
This is not a universal verifier. It does not prove that a molecule is real or that a DOI points to the right paper. It proves that the maintenance session is not starting from a broken contract.
7. A Bad BAV Card, Step by Step
This section is a constructed illustration, not a logged production incident.
Suppose an AI maintainer is asked to add a new protein-folding card. No real protein sequence exists on disk, but the model knows the expected shape of the record. It writes a plausible file with:
pTM = 0.78
pLDDT_mean = 84.2
PAE = 4.3 A
Those values look normal enough to pass casual review.
The website classifier sees pLDDT_mean and treats it as an externally-defined fold metric:
function provClassOf(label) {
const s = String(label == null ? '' : label).toLowerCase();
if (/(plddt|\bpae\b|ptm|contact|brier|\bauc\b|\bece\b)/.test(s)) return 'EXTERNAL';
if (/(p_e2e|e2e|capture|transfer)/.test(s)) return 'DERIVED';
if (/(sr9|di2|sidrce|coherence|spar|nnsl|resonance|drift|omega)/.test(s)) return 'ADVISORY-HEURISTIC';
return null;
}
The label gets a green badge. A reader may trust the number. A manuscript may cite it. A wet lab may chase a fold that was never run.
The gate breaks the chain earlier. A re-runnable fold card must ship the input sequence and a small run metadata file naming the model version and seed. Without those files, the card cannot honestly claim re-runnable status.
A standalone governance page would not stop this. A system prompt might be compressed away. A CI check may arrive too late. The contract and playbook reduce the chance of the fabrication settling into the archive as if it were evidence.
8. What This Pipeline Cannot Block
A pipeline that claims to catch everything becomes the failure mode it was built to prevent.
Five things still slip through.
A plausible fabricated value inside the normal range can pass casual inspection. Only a third-party re-run catches it.
A new promotional phrase outside the watchlist can bypass the language filter.
A [synthetic] marker can be deleted while the fabricated content remains.
A real DOI can point to the wrong paper. The validator does not fetch and judge references.
A correct computation can still be framed as the wrong thing. This was the EQA failure. It can return in a new lane under a new name.
The honest claim is narrow: MICA makes cheap slop more expensive. It does not make expensive slop automatically catchable.
9. What We Learned

This archive is small by industry standards: 56 math records, 34 biomolecular-validation experiments or cards, 2 compliance audits, roughly 90 experiments, and more than 300 files.
That was enough to teach four lessons.
First, markdown policy alone does not survive AI-assisted maintenance. The 151-mismatch fallback drift proved it.
Second, the rule list is where the policy lives. The playbook is the human layer. The validator is the structural gate. The workflow is the enforcement surface.
Third, the gate works best before code is touched. PR-time checks are necessary, but by then the maintainer is already reviewing content that should have been constrained earlier.
Fourth, the pipeline only refuses cheap slop. It does not verify molecules, fold proteins, or confirm that every citation points to the right source. That work remains external.
There is also one important limitation in this article. The two real incidents described here were caught by human attention and by a small drift script, not by a logged production refusal from the MICA validator. The validator has test fixtures that exercise refusal logic. The archive does not yet have a production refusal log entry to point at.
That is the honest state.
MICA did not magically save the archive. The archive produced scars. MICA turned those scars into a contract that future sessions must load before touching the evidence surface.
The maintainer has finite attention. Every minute spent catching a fake pTM = 0.78 is a minute not spent reading the molecule, protocol, or citation that actually needs judgment.
Session-start refusal exists to move cheap failures upstream so the saved attention can land on the expensive ones.
The contract does not pretend to verify the world. It frees the maintainer to verify the parts that matter most.
Reproduction Handle
The strongest public reproduction handle is short:
git clone https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction
cd openai-erdos-eq22-reproduction
python -m pytest
Treat any number on the ledger as a number to verify, not a number to trust.