Multimodal AI has a hidden problem.

Question

Multimodal AI has a hidden problem.

Nikhilesh TayalLeader posted Mar 31 1 min read

Images → one tokenizer

Videos → another

3D → completely different setup

And it gets worse:

Models that generate visuals don’t really understand them
Models that understand visuals can’t generate them well

So instead of one intelligent system,

we end up with a stack of disconnected capabilities.

Apple is trying to take a very different approach with new model - AToken

Instead of adding more pieces, it removes them.

One tokenizer
One encoder
Works across images, videos, and 3D

The core idea:

Treat all visual data in a unified format.

Images → (x, y)

Videos → (t, x, y)

3D → (x, y, z)

Everything becomes part of a single 4D token space.

So the same model can:

Understand
Generate
Reconstruct

Across all formats.

And the real unlock:

Data leverage.

We have massive image datasets.

But very limited video and 3D data.

With a shared model:

→ Learning transfers across modalities

→ Less data needed overall

→ Faster capability growth

This is exactly what happened with LLMs.

One tokenizer → text, code, conversations, everything.

Now we’re seeing the same shift in vision.

From:

“different models for different media”

To:

one model that understands the visual world.

1 Comment

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Steve Fentonverified · Answer 1 · 2026-03-31T09:29:34+0000

Interesting thought! Thanks for writing this. The fragmentation might end up being solved by "skills" if they can be coordinated efficiently.

	TypeScript Complexity Has Finally Reached the Point of Total Absurdity Karol Modelskiverified - Apr 23
	AI Agents Don't Have Identities. That's Everyone's Problem. Tom Smithverified - Mar 13
	Why Are There Only 13 DNS Root Servers For The Whole World? Is that a problem richarddjarbeng - May 7
	Your AI Agent Skills Have a Version Control Problem snapsynapse - Apr 22
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19

Multimodal AI has a hidden problem.

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

AI Agents Don't Have Identities. That's Everyone's Problem.

Why Are There Only 13 DNS Root Servers For The Whole World? Is that a problem

Your AI Agent Skills Have a Version Control Problem

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

More From Nikhilesh Tayal

Fast learning is overrated.

For internal AI implementation, do we need a Project Manager or a Product Manager?

Hiring Agentic AI Engineers? This one question you could ask to separate the real talent from others

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,247 amazing developers

Don't have an account? Sign up

OR

Multimodal AI has a hidden problem.

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Nikhilesh Tayal

Related Jobs

Commenters (This Week)