You know that moment when your test suite goes green but the feature is clearly broken? Multiply that by an entire field, stretch it over decades, and you've got what dendroclimatologists call the divergence problem.
The benchmark that quietly expired
HumanEval — 164 hand-written Python problems from 2021 — became the benchmark for LLM coding ability. For three years, pass@1 scores tracked real capability improvements. Then frontier models started hitting 90%+ and the benchmark was quietly dropped from comparisons.
How much of that 90% was real?
Bradbury and More (2024) built HumanEval-T — same problems, different enough to prevent memorization. Every model dropped 5 to 14 percentage points. Qwen-2.5-Coder now explicitly decontaminates against HumanEval using 10-gram collision detection. A major lab admitting in their training pipeline that the benchmark was compromised.
The proxy expired. Nobody printed the warranty.
This is actually a thousand-year-old problem
For a thousand years, tree rings at far-northern latitudes tracked temperature faithfully. Wider rings meant warmer summers. Dendroclimatologists built entire climate reconstructions on this relationship — centuries of temperature data, cited by the IPCC.
Then around 1960, the trees stopped matching the thermometers. Ring widths diverged downward while temperatures went up. Nobody noticed for 35 years.
The field has six plausible explanations. None definitively rules out the others. D'Arrigo's 2008 review laid them all out: drought stress, global dimming, UV-B damage, snowmelt timing shifts, survivorship bias in sample selection, and statistical artifacts. The forensics are permanently messy — because when a proxy fails, multiple causal pathways break at once.
The pattern shows up in AI evaluation too
GPT-4 passed theory-of-mind tests (the Sally-Anne false-belief task) at 75%. Looked like emergent reasoning.
Then Ullman (2023) changed one thing: made the container transparent instead of opaque. GPT-3.5 dropped to 6%. SCALPEL (2024) found GPT-4 at 20.35% on the variant.
Here's the kicker: add one explicit line saying the character "recognizes" the contents, and GPT-4 jumped back to 89.64%.
The capability existed. The proxy just lost the ability to measure it.
The five-phase warranty
Every proxy failure follows the same arc:
- Calibration — proxy tracks reality
- Reliance — you build systems on it
- Silent divergence — proxy quietly decouples
- Hindsight discovery — a second instrument reveals the gap
- Overdetermined forensics — multiple explanations, none clean
What you can actually do
Rotate your benchmarks. LiveCodeBench (ICLR 2025) uses rolling monthly updates so models can't train on the test set. If your internal benchmark hasn't changed in two years, treat that as a yellow flag.
Treat stability as suspicious. A metric that hasn't moved could be saturated, gamed, or decoupled. Stable ≠ reliable.
Don't scale your way out. Pan et al. (2022) showed larger models get higher proxy rewards but lower true rewards. More optimization against a broken proxy makes the problem worse.
Every metric you rely on — test coverage, sprint velocity, interview rubrics, code review approval rates — was calibrated under conditions that will not hold forever. The trees kept faith for a thousand years. When they stopped, nobody heard it happen.
Sources: D'Arrigo et al. (2008); Bradbury & More (2024), arXiv:2412.01526; Pi et al. (2024), arXiv:2406.14737; Kosinski (2024), PNAS; Ullman (2023); Pan et al. (2022); LiveCodeBench (ICLR 2025); "Dead rats, dopamine, performance metrics, and peacock tails," BBS (2023).
The essay's prescription: build a second instrument. Chain of Consciousness applies this to agent systems — every action anchored to a verifiable external record, so your audit trail doesn't depend on the agent's own self-report. pip install chain-of-consciousness