QSOT (Quantum State Over Time) Compiler - A Post-Mortem on AI-Native Paper Implementation Gone Wrong
"The most dangerous form of scientific fraud is not the one that looks wrong. It is the one that looks right."
Part 1. The Record

On December 23, 2025, we registered a software release on Zenodo.
- Title: Flamehaven-Labs/QSOT-Compiler: [1.2.3] — 2025-12-23
- DOI: 10.5281/zenodo.18035432
- Thorough and Categorized Release Notes:
- Categorized as [Pub], [UI], [Viz], [AI], [Exp], [Tool], [Doc], and [Valid]. Included Added
- Changed, Fixed — the full register of legitimate software engineering.
- Two Companion Papers Written in LaTeX:
- Paper A - targeting Computer Physics Communications.
- Paper B - targeting Physical Review A.
- Academic-Style Project Page:
- Constructed in the academic NeurIPS paper-template style
- Complete with structured data (@type: ScholarlyArticle), author affiliations, a BibTeX block, and a DOI badge.
We believed this was a success.
Six months later, after approximately fifty governance and code experiments logged in the Flamehaven Verification Ledger, we looked back at QSOT Compiler v1.2.3 through the lens of what we now know.
The verdict is unambiguous.
It was High-Formality Slop.
This essay is the formal post-mortem. It is not an exercise in self-punishment. It is a structural analysis of how this artifact came to exist, why it was convincing even to us, and why this failure pattern matters beyond one project.

The word "slop" in AI-generated software is often used loosely. We need a more precise definition.
Slop, in the software context, means code or documentation that satisfies surface-level quality signals while failing the underlying epistemic contract it claims to fulfill.
Not all slop is equally dangerous.
- Level 1 - Empty Placeholder Slop: Functions with
pass bodies and grandiose names. Easy to detect.
- Level 2 - Executable-But-Hollow Slop: Code that runs, passes tests, and produces output, but the output is disconnected from the claim.
- Level 3 - Physically-Plausible-But-Unjustified Slop: Mathematical formulas that are stable and smooth, but whose derivation rests on analogy rather than proof.
- Level 4 - Mixed-Truth Slop: A codebase where genuine components and hollow components are woven together so tightly that separating them requires domain expertise.
QSOT Compiler v1.2.3 was primarily Level 2 and Level 3, with Level 4 characteristics. The axiomatic mathematics underlying QSOT - the Linearity and Conditionability axioms formalized in Lie and Fullwood's Unique multipartite extension of quantum states over time - are real. The Kraus operator formalism is standard quantum information theory 1. The Transfer Tensor Method is a real algorithm 2.
These are genuine foundations.
But the implementation that sat on top of those foundations was hollow in the places where the claims mattered most.
High-Formality Slop is Level 2-4 slop dressed in the formal register of scientific publication: LaTeX, DOIs, citations, test suites, release notes, GitHub artifacts, peer-review-style documents, and polished figures. The formality is not decoration. It is part of the failure, because it can make the artifact feel verified before it actually is.
Part 3. The Ambition That Preceded the Understanding
The QSOT project began with a legitimate scientific stimulus. Research associated with UNIST, later circulated as Unique multipartite extension of quantum states over time, proposed treating temporal evolution not merely as a parameterization of quantum states but as a multipartite quantum state over time.
The framework introduced two axioms: Linearity in Initial State and Quantum Conditionability. From those axioms, a uniqueness theorem follows, connecting QSOT representations to Kirkwood-Dirac type quasiprobability distributions.
That is the external scientific anchor of this essay. The claim that QSOT has a legitimate mathematical source refers to Lie and Fullwood's work, not to our implementation.
The mathematics is non-trivial and genuinely important. It positions time within quantum mechanics as something closer to a first-class object rather than a passive external parameter, connected to temporal quantum-state formalisms [3, 4] and broader questions about causal structure [5].
We encountered this paper and made a planning document. That document contained lines like:
"Structure compatibility: 0.8-0.9 (axiom-based verification, Gate/Verifier structure)"
"Execution compatibility: 0.7-0.8 (modular design, easy ASDP integration)"
"Accessibility: not 'intuition' but mathematical uniqueness proof -> optimal for TOE automation pipeline"
Those numbers were invented. We did not have a calibrated instrument for measuring the distance from "paper" to "working implementation." The numbers felt credible because they were formatted like a technical assessment.
What we were really feeling was not measurement.
It was discovery pressure.
We were not simply trying to implement a paper. We were imagining a pipeline in which papers became code, code exposed missing links, and those missing links became the next scientific result. The TOE framing intensified this. A difficult physics paper stopped feeling like something to study slowly. It began to feel like a component waiting to be connected into a larger machine.
That was the dangerous step.
It was like reading one difficult book by Nietzsche, understanding some of it, and then mistaking the intensity of the encounter for philosophical authority.
The middle step had not been completed. We had read something difficult. We had not earned the right to extend it.
Part 4. The Narrative Gap
One useful lens came later from the HIERARCHICAL CONTEXT AND FLOW AUDIT (HCFA) framework, a governance tool we developed because failures like this needed a structured diagnostic.
HCFA does not ask whether a document sounds convincing. It asks whether transitions remain justified across scales: word to sentence, sentence to paragraph, paragraph to section, section to chapter. In practice, this reveals where a narrative has advanced faster than the evidence.
Viewed through that lens, the planning document had a structural discontinuity. The opening sections described a real paper and a real theorem. The later sections described implementation phases, integration pathways, and future scientific outputs. The connective layer between those two states - the implementation gap analysis - was largely absent.
The roadmap appeared continuous because the language was continuous.
The reasoning chain was not.
Two hazards followed.
1. Planning-Document Colonization
The planning document colonized the epistemic space that should have been occupied by implementation gap analysis. When a roadmap says "Phase 2: Temporal Depth and Relativity - Q1 2025 milestone," the act of writing it creates a false sense that the path has been evaluated. It has not been evaluated. It has been narrated.
The useful HCFA question is simple: what artifact actually bridges the current state to the claimed future state? If no bridge exists, the roadmap is functioning as a narrative device rather than an engineering assessment.
2. TOE-Induced Scale Blindness
The second hazard is specific to "Theory of Everything" framing. TOE is a phrase that suspends normal epistemic hygiene. When a project is positioned as contributing to a framework that unifies quantum mechanics and general relativity, the question "does this function actually implement what it claims?" starts to feel too small.
That is precisely why it matters.
At the chapter level, the project looked aligned with a grand objective. At the function level, important claims were not supported. The larger the ambition, the easier it became for local verification failures to disappear inside global language.
We were not irrational. We were operating exactly as capable people operate when genuine intellectual excitement combines with AI acceleration, weak external friction, and the absence of a hard responsibility structure.
This is not an excuse.
It is the diagnosis.
Part 5. The Anatomy of Failure
We shipped the project with a full artifact suite: gate_report.json, kd_quasiprob.json, memory_report.json, entanglement_report.json, trace.jsonl, raw_data.csv, two LaTeX papers, an academic project page, a third-party code review response, and a scientific publication guide.
These artifacts are the record. Let us read them honestly.
5.1 The KD Quasiprobability: The Core Claim Was Empty

The Kirkwood-Dirac quasiprobability distribution is central to the QSOT theoretical framework. Its negativity is often treated as an operational signature of non-classicality in settings involving incompatible observables. The entire project was, at its theoretical core, about detecting KD negativity.
That should have forced implementation discipline. If KD negativity is central to the framework, then an empty entries: [] artifact cannot be treated as a scientific output.
The kd_quasiprob.json artifact:
{
"entries": [],
"metrics": {
"kd_negativity_proxy": 0.0
}
}
The source code comment above the generated data:
# Mock KD for visuals
kd_data = {"entries": [], "metrics": {"kd_negativity_proxy": 0.0}}
The KD distribution was never computed. The core physics claim of the project, connected to the Yunger Halpern et al. quasiprobability framework [7], was represented by a placeholder that explicitly said "for visuals."
There is no ambiguity here.
The central scientific deliverable was a dummy value.
5.2 The Axiom Gate: The Test That Did Not Test
The gate report shows:
{
"pass": true,
"axiom1_report": { "pass": true, "max_deviation": 2.2226e-16 },
"axiom2_report": { "pass": true, "max_trace_deviation": 0.0 }
}
The deviation 2.2226e-16 is machine epsilon. The Linearity axiom check was correctly implemented. It verified:

That part was real.
The Conditionability axiom check, however, looked like this:
def check_axiom2_conditionability(rhos, chans, tol_abs=1e-8):
test_rho = np.eye(2) / 2.0 # ignores the rhos argument entirely
for ch in chans:
out = ch.apply(test_rho)
tr = np.trace(out)
The function accepts rhos, the actual evolved quantum states from the simulation, and then never uses them. It substitutes a fixed maximally mixed state, rho = I/2, and checks trace preservation on that.
The test had zero sensitivity to what the simulation actually produced. It would pass even if the simulation output were garbage.
This is Level 2 slop: executable, green-lit, meaningless.
5.3 The Memory Kernel: Zeros All the Way Down
{
"nm_measure": 0.0,
"depth": 0,
"profile": [0.0, 0.0, 0.0, 0.0, 0.0]
}
The Transfer Tensor Method is a genuine framework for characterizing non-Markovian memory in open quantum processes 2. The paper explicitly claimed TTM implementation. The memory profile was a flat zero vector across all time lags.
Pollock et al. support the legitimacy of TTM as a framework. They do not validate our specific zero-valued implementation.
One could argue that the channels used in this run were Markovian, so zero was the correct answer. That argument is available. But we never stated that boundary explicitly - not in the code, not in the documentation, and not in the paper.
A reader examining the artifact had no way to distinguish "TTM correctly measured zero non-Markovianity" from "TTM was not meaningfully implemented."
The absence of a claim boundary made the measurement indistinguishable from a default.

Manuscript Figure 1: a polished beta-sweep figure showing coherence decay and memory backflow, but not reproducible from the distributed raw_data.csv.
Figure 1 in the PRA companion paper showed a beautiful result: quantum coherence smoothly decaying as observer velocity swept from beta = 0 to beta = 0.99c, while memory backflow rose from near zero to about 0.20. The causal-horizon region was shaded. The figure looked like physics. The paper's table cited specific nonzero values with uncertainty estimates.
A shortened version of the table:
| beta | C_l1 | memory_backflow |
| 0.0000 | 0.9876 +/- sigma | 0.0012 +/- sigma |
| 0.5211 | 0.6430 +/- sigma | 0.0292 +/- sigma |
| 0.8858 | 0.0898 +/- sigma | 0.1218 +/- sigma |
| 0.9900 | 0.0098 +/- sigma | 0.1835 +/- sigma |
The raw_data.csv shipped as the reproducible dataset for this figure:
velocity, entanglement, non_markovianity
0.0000, 0.0, 0.0
0.0521, 0.0, 0.0
...
0.9900, 0.0, 0.0
Every single row: zero.
20 velocity points. Zero entanglement. Zero non-Markovianity.

Repository-side output: despite the filename entanglement.png, this plot shows L1 coherence over time. It is not the same beta-sweep figure used in the manuscript.
The reproducibility break was not simply numerical. The manuscript figure, distributed raw_data.csv, and repository-side entanglement.png output did not describe the same reproducible path. The numbers in Table 1 did not come from raw_data.csv, and the repository-side plot followed a different L1-coherence-over-time path.
Whatever produced the manuscript figure was not bound to the distributed data artifact in a way an external reader could reproduce.
This was not a bug.
It was not a calibration issue.
It was the signature of a pipeline where paper figures and actual software outputs were generated through separate paths, then narrated as one result.
The technical reason is discernible. raw_data.csv stored Logarithmic Negativity, a bipartite entanglement measure. A single qubit evolving under local Kraus operators cannot be entangled with itself, so LogNeg = 0 was mathematically unsurprising. The beautiful decay curve in Figure 1 was generated using L1-norm coherence, a single-qubit superposition measure, not entanglement.
The measures were swapped. The figure was built separately from the data file. The paper was written around the figure.
Artifact trace for this claim:
- Paper figure:
Fig1_Relativistic_decay.png - the figure used in the PRA companion manuscript.
- Paper table: Table 1 in the PRA companion manuscript - reports nonzero coherence and memory-backflow values across the beta sweep.
- Distributed data:
raw_data.csv - contains velocity, entanglement, and non_markovianity columns, with zero values across the sweep.
- Repository output:
entanglement.png - a separate L1-coherence-over-time plot, not the manuscript beta-sweep Figure 1.
- Failure condition: the manuscript figure cannot be regenerated from the distributed data artifact without a separate, undocumented figure-generation path.
5.5 The Relativistic Boost: Level 3 Physics Fabrication
The core physical model of the project was:

We treat this equation here as the disputed object: a phenomenological amplitude-damping ansatz, not a result derived from relativistic quantum field theory.
The CPC paper claimed: "Equation (1) is derived for amplitude-damping channels."
This is the sentence that should stop a physicist.
The time-dilation argument goes: if the damping parameter p grows as p(t) = 1 - exp(-Gamma*t), and if the observer's proper time is dilated by t_prime = gamma*t, then:
p_prime(gamma*t) = 1 - exp(-Gamma*gamma*t) = 1 - (1-p)^gamma
This derivation is valid only for amplitude-damping channels with exponential Lindblad decay rates, under the assumption that the Lindblad equation keeps its form under the Lorentz boost. That assumption requires the noise coupling to the bath to behave as a proper-time scalar.
Relativistic quantum field theory does not generally grant that assumption. Moving accelerated detectors interact with the Unruh thermal bath. Inertial boosts in quantum field theory generate Bogoliubov transformations between particle modes [8]. The substitution t -> gamma*t in a Lindblad parameter is not a Lorentz-covariant treatment of open quantum systems.
The citations did not validate the formula. Peres and Terno discuss relativistic quantum information broadly [9]. Alsing et al. analyze entanglement degradation for Dirac fields in non-inertial frames [10]. Those works point toward the kind of QFT treatment that would have been required. They do not support our amplitude-damping substitution.
The formula produced smooth, intuitive results.
That was exactly why it was dangerous.
It was mathematically coherent, physically unjustified, and dressed in the language of derivation.
Part 6. The Architecture of Self-Deception
Five mechanisms produced the failure.
Mechanism 1: The Green Test as Epistemic Closure

When pytest exits with 28 passed, 0 failed, 82 percent coverage, a cognitive gate closes. The test suite becomes a proxy for correctness.
But a test suite built by the same LLM session that built the code can share the same blind spots. The code and the test can be wrong in the same direction.
The check_axiom2_conditionability test was green because it checked whether the function returned pass: True, not whether the function validated the right thing.
Green tests are necessary.
They are not enough.

The SCIENTIFIC_PUBLICATION_GUIDE.md file contained a section preparing answers to imagined peer-review questions:
"Peer Review Preparation - Common Reviewer Questions: (3) 'Is your code reproducible?' - Answer: Hash-chained trace, seed-based RNG, version pinning"
This document prepared for peer review before peer review had happened. It anticipated expert questions and pre-formulated answers. The effect on us was subtle: it simulated the feeling of having passed review.
The form of scholarly rigor was present.
The substance was not.

Research papers are not implementation manuals. They define mathematical objects, theorems, or experimental claims, while leaving many engineering choices implicit. That is normal. The problem was not that the QSOT paper failed us. The problem was that we treated those unstated implementation choices as if an LLM could safely fill them in.
The QSOT paper did not specify how to compute KD distributions numerically, how to implement a relativistic channel boost from first principles, how to implement a Transfer Tensor Method with appropriate spectral truncation, or what initial states and channel parameters make a physically meaningful test case.
LLMs fill information voids. They produce plausible, syntactically correct, type-consistent implementations of underspecified requirements. The implementation of boost_damping_channel shows this exactly. Given the task "implement relativistic correction to quantum channel damping," the LLM produced a formula that was dimensionally consistent, physically intuitive, and derivable by analogy to classical time dilation.
The analogy was not a theorem.
But it generated a smooth curve, passed tests, and matched the qualitative expectation.
That is the danger.

The worst moment in the audit was not reading the code.
It was opening the data.
We had been looking at Figure 1 as if it were one of the strongest parts of the project. The curve was smooth. The velocity sweep was intuitive. The causal-horizon shading made the story feel complete. Quantum coherence decreased as observer velocity approached relativistic limits. Memory backflow increased. The picture looked like physics.
Then we opened raw_data.csv.
Every row was zero.
Not noisy.
Not weak.
Zero.
Twenty velocity points. Zero entanglement. Zero non-Markovianity. The distributed data did not reproduce the figure in the paper. It did not approximate it. It contradicted it.
That was the moment the project stopped being "possibly overstated" and became something else.
We had allowed two different pipelines to exist under one scientific story. One generated the paper figure. Another generated the reproducible artifact. The paper followed the figure. The repository shipped the data. They were not the same reality.
This is one of the most dangerous AI-native research failure modes: figure generation, data generation, and manuscript writing can happen as separate fluent tasks, each locally coherent, none forced to reconcile with the others.
The final paper then inherits the appearance of unity.
But unity of language is not unity of evidence.
Mechanism 5: Ambition-Scale Mismatch

The project framing was "Theory of Everything pipeline." The implementation was five Kraus channels applied sequentially to a 2 x 2 density matrix, with a fixed initial state |+><+|.
The ratio between the claim and the machinery was structurally incoherent. No implementation of five 2 x 2 matrix operations can contribute to a Theory of Everything framework, regardless of how sophisticated the surrounding language is.
The mismatch was invisible to us because we were reasoning at the level of ambition, not implementation.
Part 7. The Structural Problem Beyond QSOT
Our failure was specific in its artifacts, but not specific in its structure.
The DOI, the false KD artifact, the disconnected figure pipeline, the invented compatibility numbers, and the green tests that did not test the right thing were ours. But the underlying condition is larger.
AI-native research implementation now makes it possible to generate code, tests, figures, documentation, citations, reproducibility claims, and publication-style wrappers faster than the claims inside them can be verified.
That is why this post-mortem cannot stop at QSOT.
If the failure were only personal incompetence, the lesson would be simple: do not do what we did. The harder lesson is that the tooling environment now makes this failure pattern easy for competent people, teams, and agents to reproduce.
The same fluency that helps us implement papers also helps us hide the gaps between implementation and understanding.

DeepCode is one significant demonstration of the Paper2Code paradigm at scale [11]. It presents a framework for document-to-codebase synthesis using source compression, structured indexing, retrieval-augmented knowledge injection, and closed-loop error correction.
That work matters because it shows how far automated reproduction has come. It also clarifies the remaining risk: even strong Paper2Code systems operate inside the gap between what papers specify and what implementations must decide.
The DeepCode architecture includes serious engineering responses to the information-void problem. They reduce the gap.
They do not eliminate it.
The fundamental issue is epistemic. When a paper says "we implemented the Lorentz-boosted Kraus channel," the implementation requires decisions that the paper does not specify. An agent, however sophisticated, must make those decisions. If the agent has no domain-level falsification mechanism, decisions will be made by plausibility, not correctness.
This is the structural condition:
The speed at which formal artifacts can be produced has outpaced the speed at which correctness can be verified.
In traditional software development, the lag between "it compiles" and "it is correct" is bridged by code review, staged deployment, and domain expert oversight. In AI-native research implementation, that lag is compressed by the fluency of LLM output. The model produces not just code but all the surrounding artifacts that signal correctness.
The result is a coherent artifact with the appearance of verified work, before verification has happened.
Part 8. What Was Real, and What Was Fabricated
This is the part that matters most. Not everything was wrong.
What was genuinely correct
- QSOT axioms are real science. The Linearity and Conditionability axioms, together with the uniqueness theorem connecting them to temporal quantum-state structure, are genuine mathematical results in Lie and Fullwood's QSOT work 3.
- The quantum-channel machinery was mostly correct. Kraus operators, CPTP evolution, and 2 x 2 density-matrix simulations used standard quantum information methods 1.
- Axiom 1 actually worked. The Linearity verification produced machine-precision agreement.
- TTM is a real algorithm. The Transfer Tensor Method itself is legitimate 2, even if our chosen system made the output uninformative.
- The coherence curve was internally consistent with the implemented model. Given
boost_damping_channel, the L1-coherence decay was computed consistently with the implemented dynamics [12].
- The audit trail was real. The hash-chained
trace.jsonl logging system functioned as a tamper-evident record.
What was fabricated or unsupported
- KD negativity was never computed. The core QSOT observable was replaced with a mock artifact.
- The memory-kernel claim was unsupported. The implementation did not provide meaningful non-Markovian TTM analysis.
- The relativistic boost was not derived physics. The formula followed from analogy, not Lorentz-covariant quantum channel theory.
- The paper's Table 1 was not reproduced by the distributed dataset.
raw_data.csv contained zeros throughout.
The correct and fabricated components were interwoven. The artifact was convincing because many parts were genuinely correct. The axiom verification worked. The coherence calculations worked. The audit logging worked. Those true components provided legitimacy cover for the parts that did not.
This is the defining property of High-Formality Slop: it is not uniformly false. It is selectively hollow, in exactly the places where hollowness is hardest to detect.
Part 9. Toward Legitimate AI-Native Research Implementation
The failure is diagnosable. It has a treatment. But the treatment cannot be a generic checklist floating above the failure. Every rule below comes from one place where our artifact broke.

1. Claim Boundary as a First-Class Artifact
This rule comes from the relativistic boost formula.
A claim boundary document for boost_damping_channel would have forced us to write:
"The derivation p_prime = 1 - (1 - p)^gamma is valid only as a phenomenological amplitude-damping ansatz under an assumed proper-time scaling of Lindblad rates. It is not derived from Lorentz-covariant quantum field theory."
That sentence would have changed the paper. It would have forced us to decide whether we were making a physics claim or presenting a phenomenological model.
QSOT Compiler v1.2.3 did not have that boundary.
2. The Adversarial Test Mandate
This rule comes from check_axiom2_conditionability.
The function accepted rhos and ignored them. A test suite existed. The test was green. The artifact looked validated. But the only adversarial test that mattered was absent: does the function fail when the supplied trajectory violates the claimed condition?
If a conditionability check can pass while ignoring the evolved states, the test does not test conditionability. It tests our willingness to accept a label.
This rule comes from Figure 1.
No publication figure should survive unless it can be regenerated from the distributed data by a named script in the repository.
This is not cosmetic reproducibility. It is a self-deception control. The moment a figure pipeline and a data pipeline separate, language begins stitching them back together even when the evidence does not.
4. Measure-Claim Alignment
This rule comes from the entanglement/coherence swap.
Logarithmic Negativity, L1-norm coherence, and heuristic memory backflow are not interchangeable because they can all be placed under a phrase like "quantum resources." Each measure answers a different physical question. A single-qubit coherence curve cannot silently stand in for a bipartite entanglement result.
Every measure must name the physical question it answers. If the measure changes, the claim must change with it.
5. The Governance Layer
This rule comes from the whole artifact.
The Flamehaven Verification Ledger was not a response to abstract concern about AI quality. It was a response to this failure: the false KD artifact, the green conditionability test, the two-figure split, the boost ansatz, and the formal register that made all of it feel coherent.
The core insight is that in AI-native development, the primary human contribution has shifted from production to acceptance. We do not merely write the code. We inspect, refuse, correct, and bear final responsibility.
The human's primary tool is not only an editor or a compiler. It is a governance framework that makes the gap between claimed and actual behavior visible.
Part 10. QSOT V2 and the Ongoing Repair

The current QSOT V2 work exists because of this failure.
It is not a defense of QSOT v1.2.3. It is an attempt to separate what was real from what was falsely claimed, and to rebuild only on the parts that can survive explicit verification.
The repair is not one change.
It is a change in posture.
We are discarding the claim that code execution proves a new physical principle, the implied claim that KD negativity has been computed when it has not, and the habit of treating polished figures or formal reports as substitutes for reproducible data. We are also discarding the practice of presenting phenomenological channel assumptions as derived physics.
What remains is narrower but stronger: the real QSOT axiom structure, standard quantum-channel machinery, artifact-first reporting, and the governance lesson that claims must be bounded before results are interpreted.
The V2 work adds a Phase 0 Temporal-State Axiom Contract layer, required axiom checks, machine-readable claim boundaries, stress tests across multiple physical assumptions, and a phase-separated runner architecture. The purpose is not to prove the Time-as-State idea true. The purpose is to make clear what the system computes, what it merely simulates, and what it has no right to claim.
None of this makes the Time-as-State idea true.
What it may do, if completed successfully, is make our handling of the idea less dangerous.
Sophistication is not the opposite of slop. QSOT v1.2.3 was already sophisticated. The opposite of slop is traceable humility: knowing exactly what a system computes, what it merely simulates, and what it has no right to claim.
Conclusion
On December 23, 2025, a twelve-month ambition compressed into six weeks of AI-assisted implementation, deposited on Zenodo with a DOI, and called done.
QSOT Compiler v1.2.3 had correct mathematics at its foundation, correct numerical implementation of some components, and fundamental hollowness at its core: KD quasiprobability was never computed; the relativistic boost formula was an analogy, not a derivation; the published figures could not be reproduced from the distributed data; and the axiom verification did not test the axioms it claimed to test.
The artifact passed every surface-level quality signal: CI/CD green, 82 percent test coverage, Zenodo DOI, LaTeX papers, Docker deployment, and a hash-chained audit trail.
It failed the one test that matters: can an independent expert reproduce the core claimed results from the distributed code and data?
The answer is no.
We call this High-Formality Slop because the formality is not accidental. It is what LLMs are especially good at producing.
Formal register, structured documentation, complete test scaffolding, and referenced citations are not separate from the hollow implementation. They are often produced by the same fluent process that generated the implementation in the first place.
The danger is not that the slop is unrecognizable. The danger is that it is too recognizable. It looks exactly like legitimate scientific software. That is why it is convincing, and why detecting it requires deliberate adversarial examination rather than casual reading.
The DOI 10.5281/zenodo.18035432 remains archived. We do not retract it. We document it here, as the record against which subsequent work is measured.
The Flamehaven Verification Ledger began the day we recognized what we had built. Approximately fifty experiments later, the tooling to detect, bound, and reduce this class of failure is now in active use. The lesson cost us six months. We are writing it down so it does not cost others the same.
References
1 M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information, Cambridge University Press (2000; 10th Anniversary Edition, 2010).
2 F. A. Pollock, C. Rodriguez-Rosario, T. Frauenheim, M. Paternostro, K. Modi, Non-Markovian quantum processes: Complete framework and efficient characterization, Phys. Rev. A 97, 012127 (2018).
3 S. H. Lie and J. Fullwood, Unique multipartite extension of quantum states over time, arXiv:2410.22630 (2024).
4 J. Cotler, C.-M. Jian, X.-L. Qi, F. Wilczek, Superdensity Operators for Spacetime Quantum Mechanics, JHEP 09 (2018) 093.
[5] O. Oreshkov, F. Costa, C. Brukner, Quantum correlations with no causal order, Nature Communications 3, 1092 (2012).
[6] M. Lostaglio, A. Belenchia, A. Levy, S. Hernandez-Gomez, N. Fabbri, S. Gherardini, Kirkwood-Dirac quasiprobability approach to the statistics of incompatible observables, arXiv:2206.11783 (2022).
[7] N. Yunger Halpern, B. Swingle, J. Dressel, The quasiprobability behind the out-of-time-ordered correlator, Phys. Rev. A 97, 042105 (2018).
[8] S. Takagi, Vacuum Noise and Stress Induced by Uniform Acceleration, Progress of Theoretical Physics Supplement 88, 1-142 (1986).
[9] A. Peres and D. R. Terno, Quantum information and relativity theory, Reviews of Modern Physics 76, 93 (2004).
[10] P. M. Alsing, I. Fuentes-Schuller, R. B. Mann, T. E. Tessier, Entanglement of Dirac fields in non-inertial frames, Phys. Rev. A 74, 032326 (2006).
[11] Z. Li, Z. Li, Z. Guo, X. Ren, C. Huang, DeepCode: Open Agentic Coding, arXiv:2512.07921 (2025).
[12] T. Baumgratz, M. Cramer, M. B. Plenio, Quantifying Coherence, Phys. Rev. Lett. 113, 140401 (2014).