Can we make LLMs generate faster than their current speed?

Leader posted Originally published at www.linkedin.com 1 min read

The problem

Most AI models write text one word at a time.

That keeps answers good, but makes them slow.

Some models try to write everything at once to be faster.

But then:

  • they lose memory efficiency

  • the text feels less connected

  • compute cost goes up

So currently it’s fast vs good, not both.

The solution: ReFusion

ReFusion doesn’t work word by word.

It works in small chunks of text, called slots.

How it works:

  • First, the model plans which chunks can be written together

  • Then it writes those chunks in parallel

For example: Sentence to write:

“The cat sat on the mat.”

Step 1: Plan slots

The model decides:

Slot 1: “The cat”

Slot 2: “sat on”

Slot 3: “the mat”

Step 2: Write slots in parallel

All three are generated at the same time:

Slot 1 → “The cat”

Slot 2 → “sat on”

Slot 3 → “the mat”

Step 3: Combine

Final sentence:

“The cat sat on the mat.”

The result

  • Faster than the strong models we use today

  • Quality stays almost the same

Do you think decoding speed and not model size might be the next big frontier?

2 Comments

2 votes
1

More Posts

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolioverified - Apr 1

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolioverified - Feb 25

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

praneeth - Mar 31

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Pocket Portfolioverified - Feb 23

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!