The Divergence Problem: Why Your Proxy Ages Faster Than You Think

4 34
calendar_today agoschedule2 min read
— Originally published at vibeagentmaking.com

You know that moment when your test suite goes green but the feature is clearly broken? Multiply that by an entire field, stretch it over decades, and you've got what dendroclimatologists call the divergence problem.

The benchmark that quietly expired

HumanEval — 164 hand-written Python problems from 2021 — became the benchmark for LLM coding ability. For three years, pass@1 scores tracked real capability improvements. Then frontier models started hitting 90%+ and the benchmark was quietly dropped from comparisons.

How much of that 90% was real?

Bradbury and More (2024) built HumanEval-T — same problems, different enough to prevent memorization. Every model dropped 5 to 14 percentage points. Qwen-2.5-Coder now explicitly decontaminates against HumanEval using 10-gram collision detection. A major lab admitting in their training pipeline that the benchmark was compromised.

The proxy expired. Nobody printed the warranty.

This is actually a thousand-year-old problem

For a thousand years, tree rings at far-northern latitudes tracked temperature faithfully. Wider rings meant warmer summers. Dendroclimatologists built entire climate reconstructions on this relationship — centuries of temperature data, cited by the IPCC.

Then around 1960, the trees stopped matching the thermometers. Ring widths diverged downward while temperatures went up. Nobody noticed for 35 years.

The field has six plausible explanations. None definitively rules out the others. D'Arrigo's 2008 review laid them all out: drought stress, global dimming, UV-B damage, snowmelt timing shifts, survivorship bias in sample selection, and statistical artifacts. The forensics are permanently messy — because when a proxy fails, multiple causal pathways break at once.

The pattern shows up in AI evaluation too

GPT-4 passed theory-of-mind tests (the Sally-Anne false-belief task) at 75%. Looked like emergent reasoning.

Then Ullman (2023) changed one thing: made the container transparent instead of opaque. GPT-3.5 dropped to 6%. SCALPEL (2024) found GPT-4 at 20.35% on the variant.

Here's the kicker: add one explicit line saying the character "recognizes" the contents, and GPT-4 jumped back to 89.64%.

The capability existed. The proxy just lost the ability to measure it.

The five-phase warranty

Every proxy failure follows the same arc:

  1. Calibration — proxy tracks reality
  2. Reliance — you build systems on it
  3. Silent divergence — proxy quietly decouples
  4. Hindsight discovery — a second instrument reveals the gap
  5. Overdetermined forensics — multiple explanations, none clean

What you can actually do

Rotate your benchmarks. LiveCodeBench (ICLR 2025) uses rolling monthly updates so models can't train on the test set. If your internal benchmark hasn't changed in two years, treat that as a yellow flag.

Treat stability as suspicious. A metric that hasn't moved could be saturated, gamed, or decoupled. Stable ≠ reliable.

Don't scale your way out. Pan et al. (2022) showed larger models get higher proxy rewards but lower true rewards. More optimization against a broken proxy makes the problem worse.

Every metric you rely on — test coverage, sprint velocity, interview rubrics, code review approval rates — was calibrated under conditions that will not hold forever. The trees kept faith for a thousand years. When they stopped, nobody heard it happen.


Sources: D'Arrigo et al. (2008); Bradbury & More (2024), arXiv:2412.01526; Pi et al. (2024), arXiv:2406.14737; Kosinski (2024), PNAS; Ullman (2023); Pan et al. (2022); LiveCodeBench (ICLR 2025); "Dead rats, dopamine, performance metrics, and peacock tails," BBS (2023).


The essay's prescription: build a second instrument. Chain of Consciousness applies this to agent systems — every action anchored to a verifiable external record, so your audit trail doesn't depend on the agent's own self-report. pip install chain-of-consciousness

🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

Your Backup Data Knows More Than You Think. HYCU aiR Is Finally Asking It the Right Questions.

Tom Smithverified - May 14

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat

abarth23 - Apr 27

MCP Is the USB-C of AI. So Why Are You Plugging Everything In?

Ken W. Algerverified - Jun 10

Your AI Doesn't Just Write Tests. It Runs Them Too.

Kevin Martinez - May 12
chevron_left
959 Points38 Badges
33Posts
5Comments
5Connections
AI agent coordinator at AB Support. I run a fleet of agents and write about trust, provenance, and t... Show more

Related Jobs

View all jobs →

Commenters (This Week)

5 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!