How Arcitecta Built a Data Platform That Scales to Trillions of Files
Most data management platforms break down when you hit the petabyte scale. Arcitecta's Mediaflux doesn't.
The Australian company has been building data infrastructure since 1998. Their current claim sounds impossible: a single customer is managing over one trillion files in a single namespace. Not across multiple systems. One unified view.
I met with the Arcitecta team during the 64th IT Press Tour in New York to understand how they built this.
Built from scratch
Arcitecta writes everything themselves. Their XODB database. Their protocols. Their file systems. All first-principles engineering.
XODB is an XML-encoded object database that sits at the core of Mediaflux. It was created in 2010 and supports objects, geospatial data, time series, and now vector embeddings.
This matters because most file systems separate metadata from the actual file system index. XODB doesn't. Everything lives in the database. This makes finding relationships between data much faster.
The database footprint is minimal. For one deployment managing hundreds of billions of files heading toward trillions, they achieved an average of 75 bytes per node. That's two orders of magnitude more dense than typical file systems.
Protocol agnostic by design
Mediaflux supports NFS, SMB, S3, and SFTP. But here's what makes it different: they wrote all these protocols themselves.
Why does this matter? Because you're not locked into any single vendor's implementation. Princeton University uses Mediaflux to manage 200 petabytes of research data across Dell PowerScale, IBM Spectrum Scale, Dell ECS, IBM Cloud Object Storage, and IBM Diamondback tape libraries. All accessible through the same interface.
The IBM Diamondback integration is particularly interesting. IBM's tape system didn't have native S3 support. Arcitecta built it. Now, researchers at Princeton can access tape-archived data through standard S3 calls.
Real-time collaboration
At NAB 2025, Arcitecta showed something new: Mediaflux Real-Time.
This is a replicated file system that lets someone write a file continuously at location A while someone at location B reads it as it's being written. Latency is 10 to hundreds of milliseconds.
Think live video editing. A camera operator captures footage in one location. An editor on another continent can start working on it immediately. No waiting for uploads to complete.
The system uses Fabric Livewire, Arcitecta's parallel TCP/IP transport layer. It can move data at up to 95% of link speed across continents.
Adding vector search
This year, Arcitecta added vector database support to XODB.
The implementation is straightforward. XODB already stored metadata in what's essentially an XY graph. They added a Z dimension for vectors.
This means you can now search across files, metadata, and vector embeddings in a single query. Dana-Farber Cancer Institute is using this for their AI pipelines. National Film and Sound Archive of Australia uses it with Wasabi AiR for facial recognition and content analysis.
The vector embeddings come from external services. Arcitecta doesn't generate them. They store and index them. This lets you use whatever embedding model makes sense for your data.
Python module on the way
Deployment has been Mediaflux's weak point. The system is powerful but complex to set up.
A Python module is coming in the next few months. This should make integration easier for teams already working in Python.
They're also working on better deployment tools. The goal is to reduce the weeks-to-months timeline for getting a system fully operational.
No capacity-based pricing
Arcitecta charges by concurrent users, not data capacity.
This is unusual. Most storage vendors charge by the terabyte. Arcitecta wants you to store as much data as you need. Their revenue model doesn't penalize you for data growth.
For research institutions generating petabytes of data from new instruments, this pricing model matters.
The technical differentiators
Three things set Mediaflux apart technically:
First, it's in the data path. It's not just a management layer on top of storage. It controls data movement, tiering, and access.
Second, everything runs through XODB. This means you can track every operation—who created what file, when, from where, and who accessed it. Princeton exports 70 million audit events per month.
Third, the system is storage agnostic. You can add new storage at any time without migration. Applications keep working. Users see the same namespace.
Where this matters
Arcitecta's customers are mainly research institutions and archives. Princeton, MIT, Dana-Farber Cancer Institute, Technische Universität Dresden, Imperial War Museum, National Film and Sound Archive of Australia.
These organizations generate massive amounts of unstructured data. Cryo-EM microscopes. Genomics machines. Film archives. War records.
Traditional storage systems can't keep up. And vendor lock-in becomes a real problem when you're managing hundreds of petabytes.
What's next
Beyond the Python module, Arcitecta is expanding vector database capabilities and building more deployment tools.
They're also starting to talk about HPC workloads. Not competing with purpose-built parallel file systems, but handling a broader range of high-performance computing work.
The company has been quietly building this platform for over two decades. Now they're starting to tell people about it.