Australian data firm Arcitecta manages a trillion files in one namespace—adds vector search.

Question

Australian data firm Arcitecta manages a trillion files in one namespace—adds vector search.

Tom SmithBackerLeader posted Oct 7 3 min read

How Arcitecta Built a Data Platform That Scales to Trillions of Files

Most data management platforms break down when you hit the petabyte scale. Arcitecta's Mediaflux doesn't.

The Australian company has been building data infrastructure since 1998. Their current claim sounds impossible: a single customer is managing over one trillion files in a single namespace. Not across multiple systems. One unified view.

I met with the Arcitecta team during the 64th IT Press Tour in New York to understand how they built this.

Built from scratch

Arcitecta writes everything themselves. Their XODB database. Their protocols. Their file systems. All first-principles engineering.

XODB is an XML-encoded object database that sits at the core of Mediaflux. It was created in 2010 and supports objects, geospatial data, time series, and now vector embeddings.

This matters because most file systems separate metadata from the actual file system index. XODB doesn't. Everything lives in the database. This makes finding relationships between data much faster.

The database footprint is minimal. For one deployment managing hundreds of billions of files heading toward trillions, they achieved an average of 75 bytes per node. That's two orders of magnitude more dense than typical file systems.

Protocol agnostic by design

Mediaflux supports NFS, SMB, S3, and SFTP. But here's what makes it different: they wrote all these protocols themselves.

Why does this matter? Because you're not locked into any single vendor's implementation. Princeton University uses Mediaflux to manage 200 petabytes of research data across Dell PowerScale, IBM Spectrum Scale, Dell ECS, IBM Cloud Object Storage, and IBM Diamondback tape libraries. All accessible through the same interface.

The IBM Diamondback integration is particularly interesting. IBM's tape system didn't have native S3 support. Arcitecta built it. Now, researchers at Princeton can access tape-archived data through standard S3 calls.

Real-time collaboration

At NAB 2025, Arcitecta showed something new: Mediaflux Real-Time.

This is a replicated file system that lets someone write a file continuously at location A while someone at location B reads it as it's being written. Latency is 10 to hundreds of milliseconds.

Think live video editing. A camera operator captures footage in one location. An editor on another continent can start working on it immediately. No waiting for uploads to complete.

The system uses Fabric Livewire, Arcitecta's parallel TCP/IP transport layer. It can move data at up to 95% of link speed across continents.

Adding vector search

This year, Arcitecta added vector database support to XODB.

The implementation is straightforward. XODB already stored metadata in what's essentially an XY graph. They added a Z dimension for vectors.

This means you can now search across files, metadata, and vector embeddings in a single query. Dana-Farber Cancer Institute is using this for their AI pipelines. National Film and Sound Archive of Australia uses it with Wasabi AiR for facial recognition and content analysis.

The vector embeddings come from external services. Arcitecta doesn't generate them. They store and index them. This lets you use whatever embedding model makes sense for your data.

Python module on the way

Deployment has been Mediaflux's weak point. The system is powerful but complex to set up.

A Python module is coming in the next few months. This should make integration easier for teams already working in Python.

They're also working on better deployment tools. The goal is to reduce the weeks-to-months timeline for getting a system fully operational.

No capacity-based pricing

Arcitecta charges by concurrent users, not data capacity.

This is unusual. Most storage vendors charge by the terabyte. Arcitecta wants you to store as much data as you need. Their revenue model doesn't penalize you for data growth.

For research institutions generating petabytes of data from new instruments, this pricing model matters.

The technical differentiators

Three things set Mediaflux apart technically:

First, it's in the data path. It's not just a management layer on top of storage. It controls data movement, tiering, and access.

Second, everything runs through XODB. This means you can track every operation—who created what file, when, from where, and who accessed it. Princeton exports 70 million audit events per month.

Third, the system is storage agnostic. You can add new storage at any time without migration. Applications keep working. Users see the same namespace.

Where this matters

Arcitecta's customers are mainly research institutions and archives. Princeton, MIT, Dana-Farber Cancer Institute, Technische Universität Dresden, Imperial War Museum, National Film and Sound Archive of Australia.

These organizations generate massive amounts of unstructured data. Cryo-EM microscopes. Genomics machines. Film archives. War records.

Traditional storage systems can't keep up. And vendor lock-in becomes a real problem when you're managing hundreds of petabytes.

What's next

Beyond the Python module, Arcitecta is expanding vector database capabilities and building more deployment tools.

They're also starting to talk about HPC workloads. Not competing with purpose-built parallel file systems, but handling a broader range of high-performance computing work.

The company has been quietly building this platform for over two decades. Now they're starting to tell people about it.

chevron_left

Ben Kiehl · Answer 1 · 2025-10-07T12:49:29+0000

The mix of first-principles design and real-time collaboration feels like a quiet revolution in storage architecture.

	New York startup replaces three media tools with one platform using FUSE, S3, and custom AI models. Tom Smith - Oct 13
	Komprise transforms unstructured data management for AI, delivering security, governance, and 70% storage cost savings. Tom Smith - May 11
	OpenSearch 3.0: 9.5x faster vector search, GPU acceleration & AI agents for next-gen apps at enterprise scale. Tom Smith - Jun 8
	Beyond Elasticsearch: The Rise of Unified Vector and Keyword Search ekurze - Jan 22
	A data science project on heart attack risk predictor with Eval Machine Learning (Part one) Onumaku C Victory - Jun 1, 2024

Australian data firm Arcitecta manages a trillion files in one namespace—adds vector search.

How Arcitecta Built a Data Platform That Scales to Trillions of Files

Built from scratch

Protocol agnostic by design

Real-time collaboration

Adding vector search

Python module on the way

No capacity-based pricing

The technical differentiators

Where this matters

What's next

0 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

New York startup replaces three media tools with one platform using FUSE, S3, and custom AI models.

Komprise transforms unstructured data management for AI, delivering security, governance, and 70% storage cost savings.

OpenSearch 3.0: 9.5x faster vector search, GPU acceleration & AI agents for next-gen apps at enterprise scale.

Beyond Elasticsearch: The Rise of Unified Vector and Keyword Search

A data science project on heart attack risk predictor with Eval Machine Learning (Part one)

More From Tom Smith

The Three Levels of Agentic AI: Where Real Value Lives in 2026

AI writes more code in minutes than you review in days—and that's becoming a problem.

Snowflake integrates NVIDIA CUDA-X libraries so your Python ML code runs up to 200x faster on GPUs.

Welcome to Coder Legion

Connect with 3,034 amazing developers

Don't have an account? Sign up

OR

Australian data firm Arcitecta manages a trillion files in one namespace—adds vector search.

How Arcitecta Built a Data Platform That Scales to Trillions of Files

Built from scratch

Protocol agnostic by design

Real-time collaboration

Adding vector search

Python module on the way

No capacity-based pricing

The technical differentiators

Where this matters

What's next

0 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Tom Smith