If you are running LLMs in production, you have probably seen this:

Leader posted 1 min read

If you are running LLMs in production, you have probably seen this:

  • Models slowing down when reasoning gets deep

  • Fewer users per server than expected

  • GPU bills rising fast

The problem isn’t compute.

It’s memory.

When models “think” step by step, they store every reasoning token in GPU memory (KV cache).

The longer they think, the larger this cache becomes.

Memory grows linearly. Costs grow with it.

So teams face a trade-off:

Let the model think longer → better accuracy → higher infra cost

Limit thinking → lower cost → weaker performance

To solve this proble, Nvidia introduced something interesting: Dynamic Memory Sparsification (DMS).

Instead of keeping every token, the model learns what to keep and what to evict, intelligently.

Even better, it doesn’t delete things immediately. It briefly keeps “maybe useful” tokens before removing them. That avoids hurting reasoning quality.

How?

They add a small decision layer inside the attention mechanism. For every token, the model also predicts whether that token will matter for future reasoning.

During training:

Most of the model is frozen.

Only this “memory policy” is trained.

If deleting a token changes the final answer, the model is penalized.

So it gradually learns:

“If removing this breaks my reasoning → keep it.”

“If nothing changes → it’s redundant.”

There’s also a safety mechanism called delayed eviction. Tokens aren’t deleted instantly - they’re kept briefly so the model can absorb any remaining signal before removal.

The impact?

• Up to 8× lower memory cost

• Up to 5× higher throughput

• In some cases, better reasoning under the same budget

As AI systems move from simple chat to multi-step reasoning agents, memory, not compute, may become your biggest constraint.

Worth watching closely.

2 Comments

2 votes
2 votes

More Posts

5 Things This Playwright SQL Fixture Does So You Don't Have To

vitalicset - Apr 13

Just completed another large-scale WordPress migration — and the client left this

saqib_devmorph - Apr 7

AI Agents Don't Have Identities. That's Everyone's Problem.

Tom Smithverified - Mar 13

Hiring Agentic AI Engineers? This one question you could ask to separate the real talent from others

Nikhilesh Tayal - May 14

If you are using Claude Code in production, the agent is not the problem

Jon - May 8
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

11 comments
2 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!