Volumez: Reinventing Cloud Infrastructure for AI/ML Workloads
The increasing complexity of AI/ML workloads is exposing critical inefficiencies in how cloud infrastructure is provisioned and managed. A Silicon Valley startup called Volumez is tackling this problem with a novel approach that could revolutionize how engineers architect their AI infrastructure stacks.
The Infrastructure Imbalance Problem
Traditional cloud infrastructure for AI workloads typically suffers from several key inefficiencies:
- I/O bottlenecks: GPUs often sit idle waiting for data
- Overprovisioning: Engineers compensate for bottlenecks by purchasing excess capacity
- Configuration complexity: Optimal cloud configuration is beyond human capability
- Operational overhead: Data scientists get dragged into infrastructure management
- Cost inefficiency: Cloud bills skyrocket without corresponding performance gains
According to Volumez's internal benchmarks, these inefficiencies can lead to GPU utilization rates below 80%, effectively doubling the cost of model training while extending development timelines.
A New Architecture: Data Infrastructure as a Service
Volumez has developed what they call "Data Infrastructure as a Service" (DIaaS), which fundamentally rethinks how cloud resources are configured for AI workloads.
The key innovation is what they call "cloud awareness" - deep profiling of cloud provider capabilities to create balanced infrastructure that precisely matches workload requirements without proprietary components in the data path.
# Traditional approach - manual configuration
# Configure storage volumes
aws ec2 create-volume --volume-type io2 --iops 50000 --size 4000 --region us-west-2
# Configure network settings
aws ec2 modify-instance-attribute --instance-id i-1234567890abcdef0 --ebs-optimized
# Volumez approach - declarative configuration
import volumez as vlz
# Define policy
policy = {
"name": "training-infrastructure",
"performance": {
"iops": 1000000,
"latency_usec": 300
},
"encryption": True,
"resilience": {
"zones": 1
}
}
# Create infrastructure
vlz.create_infrastructure(policy)
Technical Implementation Details
The implementation relies on three key components:
1. Cloud Profiling Engine: Conducts deep analysis of cloud provider components, including:
- Physical server topology
- Network fabric characteristics
- Storage media specifications
- Cost structures
2. Configurator Agent: A lightweight user-space process that:
- Implements optimal Linux configurations
- Adjusts queue depths
- Configures multipathing
- Sets up resilience schemes
3. Declarative API Layer: Enables infrastructure definition through:
- PyTorch library extensions
- Terraform/Bicep modules
- REST APIs
- Kubernetes operators
The critical insight is that by precisely understanding cloud components and their interactions, a SaaS system can automatically create optimal Linux configurations without introducing proprietary storage controllers or other middleware.
Performance at Scale
In MLCommons MLPerf Storage 1.0 benchmarks, Volumez demonstrated:
- 1.14 TB/sec throughput
- 9.9M IOPS
- 92% GPU utilization
- 411 simulated GPUs
These results significantly outperformed traditional storage approaches from vendors like DDN, NetApp, and VAST Data, while showing cost reductions of 27-70% for storage and 50-92% for compute compared to standard AWS configurations.
Implementation Patterns
The platform offers two main deployment patterns that engineers should consider:
1. Hyperconverged Pattern
┌─────────────────┐ ┌─────────────────┐
│ GPU Instance │ │ GPU Instance │
│ ┌─────┐ ┌─────┐ │ │ ┌─────┐ ┌─────┐ │
│ │ GPU │ │ GPU │ │ │ │ GPU │ │ GPU │ │
│ └─────┘ └─────┘ │ │ └─────┘ └─────┘ │
│ │ │ │ │ │ │ │
│ ┌─────────────┐│ │ ┌─────────────┐│
│ │ Local SSDs ││ │ │ Local SSDs ││
│ └─────────────┘│ │ └─────────────┘│
└─────────────────┘ └─────────────────┘
- Uses local SSDs on GPU servers
- Best for datasets under 100TB
- Simpler deployment model
- Good for persistent clusters
2. Flex Pattern
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GPU Cluster │ │ GPU Cluster │ │ GPU Cluster │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ GPU GPU │ │ │ │ GPU GPU │ │ │ │ GPU GPU │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ │ │
└────────────────┼────────────────┘
│
┌───▼───┐
┌──────────────┐ │Network│ ┌──────────────┐
│ Storage Node │◄───┤ ├───►│ Storage Node │
│ ┌──────────┐ │ └───────┘ │ ┌──────────┐ │
│ │ SSD SSD │ │ │ │ SSD SSD │ │
│ │ SSD SSD │ │ │ │ SSD SSD │ │
│ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘
- Scales storage independently from compute
- Supports datasets >100TB
- Better for dynamic clusters
- Higher resilience options
Developer Experience
For developers and data scientists, the most compelling feature is the ability to provision infrastructure directly from Python notebooks:
import volumez as vlz
from torch.utils.data import DataLoader
# Create managed dataset with automated infrastructure
dataset = vlz.datasets.ImageFolder(
dataset_name="medical-scans",
mode=vlz.datasets.Mode.Train,
version=vlz.datasets.Version.Latest,
credentials=credentials,
mount="/mnt/dataset"
)
# Infrastructure automatically provisioned and optimized
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Process data normally
for epoch in range(10):
for i, batch in enumerate(dataloader):
# train step
pass
# Infrastructure automatically torn down when complete
dataset.close()
This eliminates the need for separate MLOps or infrastructure teams, allowing data scientists to work independently while maintaining optimal performance.
Engineering Considerations
When evaluating this approach for your own AI infrastructure, consider:
1. Workload characteristics
- File sizes and counts
- Data access patterns
- Batch sizes and preprocessing needs
2. Team structure
- Data scientist to MLOps engineer ratio
- Team experience with infrastructure
- Preferred workflow tools
3. Cost structure
- GPU utilization targets
- Infrastructure budget constraints
- Performance requirements
Conclusion
Volumez's approach represents a significant rethinking of how cloud infrastructure should be configured for AI workloads. By leveraging deep cloud awareness and standard Linux components, their platform offers a compelling alternative to traditional storage-centric approaches.
For engineering teams struggling with AI infrastructure bottlenecks, particularly those dealing with large-scale training workloads, this cloud-aware approach could offer substantial performance improvements while reducing both infrastructure costs and operational complexity.
The most impressive aspect may be how Volumez manages to achieve these outcomes without introducing proprietary middleware or storage controllers - instead, they're simply making better use of what the cloud already provides through intelligent configuration.