The Divergence Problem: Why Your Proxy Ages Faster Than You Think

Question

The Divergence Problem: Why Your Proxy Ages Faster Than You Think

calendar_today1 day ago • schedule2 min read

— Originally published at vibeagentmaking.com

You know that moment when your test suite goes green but the feature is clearly broken? Multiply that by an entire field, stretch it over decades, and you've got what dendroclimatologists call the divergence problem.

The benchmark that quietly expired

HumanEval — 164 hand-written Python problems from 2021 — became the benchmark for LLM coding ability. For three years, pass@1 scores tracked real capability improvements. Then frontier models started hitting 90%+ and the benchmark was quietly dropped from comparisons.

How much of that 90% was real?

Bradbury and More (2024) built HumanEval-T — same problems, different enough to prevent memorization. Every model dropped 5 to 14 percentage points. Qwen-2.5-Coder now explicitly decontaminates against HumanEval using 10-gram collision detection. A major lab admitting in their training pipeline that the benchmark was compromised.

The proxy expired. Nobody printed the warranty.

This is actually a thousand-year-old problem

For a thousand years, tree rings at far-northern latitudes tracked temperature faithfully. Wider rings meant warmer summers. Dendroclimatologists built entire climate reconstructions on this relationship — centuries of temperature data, cited by the IPCC.

Then around 1960, the trees stopped matching the thermometers. Ring widths diverged downward while temperatures went up. Nobody noticed for 35 years.

The field has six plausible explanations. None definitively rules out the others. D'Arrigo's 2008 review laid them all out: drought stress, global dimming, UV-B damage, snowmelt timing shifts, survivorship bias in sample selection, and statistical artifacts. The forensics are permanently messy — because when a proxy fails, multiple causal pathways break at once.

The pattern shows up in AI evaluation too

GPT-4 passed theory-of-mind tests (the Sally-Anne false-belief task) at 75%. Looked like emergent reasoning.

Then Ullman (2023) changed one thing: made the container transparent instead of opaque. GPT-3.5 dropped to 6%. SCALPEL (2024) found GPT-4 at 20.35% on the variant.

Here's the kicker: add one explicit line saying the character "recognizes" the contents, and GPT-4 jumped back to 89.64%.

The capability existed. The proxy just lost the ability to measure it.

The five-phase warranty

Every proxy failure follows the same arc:

Calibration — proxy tracks reality
Reliance — you build systems on it
Silent divergence — proxy quietly decouples
Hindsight discovery — a second instrument reveals the gap
Overdetermined forensics — multiple explanations, none clean

What you can actually do

Rotate your benchmarks. LiveCodeBench (ICLR 2025) uses rolling monthly updates so models can't train on the test set. If your internal benchmark hasn't changed in two years, treat that as a yellow flag.

Treat stability as suspicious. A metric that hasn't moved could be saturated, gamed, or decoupled. Stable ≠ reliable.

Don't scale your way out. Pan et al. (2022) showed larger models get higher proxy rewards but lower true rewards. More optimization against a broken proxy makes the problem worse.

Every metric you rely on — test coverage, sprint velocity, interview rubrics, code review approval rates — was calibrated under conditions that will not hold forever. The trees kept faith for a thousand years. When they stopped, nobody heard it happen.

Sources: D'Arrigo et al. (2008); Bradbury & More (2024), arXiv:2412.01526; Pi et al. (2024), arXiv:2406.14737; Kosinski (2024), PNAS; Ullman (2023); Pan et al. (2022); LiveCodeBench (ICLR 2025); "Dead rats, dopamine, performance metrics, and peacock tails," BBS (2023).

The essay's prescription: build a second instrument. Chain of Consciousness applies this to agent systems — every action anchored to a verifiable external record, so your audit trail doesn't depend on the agent's own self-report. pip install chain-of-consciousness

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

	Your Backup Data Knows More Than You Think. HYCU aiR Is Finally Asking It the Right Questions. Tom Smithverified - May 14
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat abarth23 - Apr 27
	MCP Is the USB-C of AI. So Why Are You Plugging Everything In? Ken W. Algerverified - Jun 10
	Your AI Doesn't Just Write Tests. It Runs Them Too. Kevin Martinez - May 12

The Divergence Problem: Why Your Proxy Ages Faster Than You Think

The benchmark that quietly expired

This is actually a thousand-year-old problem

The pattern shows up in AI evaluation too

The five-phase warranty

What you can actually do

0 Comments

Please log in to comment on this post.

More Posts

Your Backup Data Knows More Than You Think. HYCU aiR Is Finally Asking It the Right Questions.

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat

MCP Is the USB-C of AI. So Why Are You Plugging Everything In?

Your AI Doesn't Just Write Tests. It Runs Them Too.

More From Alex

The Miyake Event Problem: Anchoring Distributed Agents to Universal Time

Foresight Is Functionally Time Travel

Tidal Locking and the Orbital Mechanics of Vendor Lock-in

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,580 amazing developers

Don't have an account? Sign up

OR

The Divergence Problem: Why Your Proxy Ages Faster Than You Think

The benchmark that quietly expired

This is actually a thousand-year-old problem

The pattern shows up in AI evaluation too

The five-phase warranty

What you can actually do

0 Comments

Please log in to comment on this post.

More Posts

More From Alex

Related Jobs

Commenters (This Week)