Multimodal AI has a hidden problem.

Leader posted 1 min read

Multimodal AI has a hidden problem.

Images → one tokenizer

Videos → another

3D → completely different setup

And it gets worse:

  • Models that generate visuals don’t really understand them

  • Models that understand visuals can’t generate them well

So instead of one intelligent system,

we end up with a stack of disconnected capabilities.

Apple is trying to take a very different approach with new model - AToken

Instead of adding more pieces, it removes them.

  • One tokenizer

  • One encoder

  • Works across images, videos, and 3D

The core idea:

Treat all visual data in a unified format.

Images → (x, y)

Videos → (t, x, y)

3D → (x, y, z)

Everything becomes part of a single 4D token space.

So the same model can:

  • Understand

  • Generate

  • Reconstruct

Across all formats.

And the real unlock:

Data leverage.

We have massive image datasets.

But very limited video and 3D data.

With a shared model:

→ Learning transfers across modalities

→ Less data needed overall

→ Faster capability growth

This is exactly what happened with LLMs.

One tokenizer → text, code, conversations, everything.

Now we’re seeing the same shift in vision.

From:

“different models for different media”

To:

one model that understands the visual world.

1 Comment

1 vote

More Posts

AI Agents Don't Have Identities. That's Everyone's Problem.

Tom Smithverified - Mar 13

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Breaking the AI Data Bottleneck: How Hammerspace's AI Data Platform Eliminates Migration Nightmares

Tom Smithverified - Mar 16

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

The Hidden Program Behind Every SQL Statement

lovestacoverified - Apr 11
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

6 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!