Cloud-aware platform helps achieve 92% GPU utilization while slashing infrastructure costs for AI workloads.

BackerLeader posted 3 min read

Volumez: Reinventing Cloud Infrastructure for AI/ML Workloads

The increasing complexity of AI/ML workloads is exposing critical inefficiencies in how cloud infrastructure is provisioned and managed. A Silicon Valley startup called Volumez is tackling this problem with a novel approach that could revolutionize how engineers architect their AI infrastructure stacks.

The Infrastructure Imbalance Problem

Traditional cloud infrastructure for AI workloads typically suffers from several key inefficiencies:

  • I/O bottlenecks: GPUs often sit idle waiting for data
  • Overprovisioning: Engineers compensate for bottlenecks by purchasing excess capacity
  • Configuration complexity: Optimal cloud configuration is beyond human capability
  • Operational overhead: Data scientists get dragged into infrastructure management
  • Cost inefficiency: Cloud bills skyrocket without corresponding performance gains

According to Volumez's internal benchmarks, these inefficiencies can lead to GPU utilization rates below 80%, effectively doubling the cost of model training while extending development timelines.

A New Architecture: Data Infrastructure as a Service

Volumez has developed what they call "Data Infrastructure as a Service" (DIaaS), which fundamentally rethinks how cloud resources are configured for AI workloads.

The key innovation is what they call "cloud awareness" - deep profiling of cloud provider capabilities to create balanced infrastructure that precisely matches workload requirements without proprietary components in the data path.

# Traditional approach - manual configuration
# Configure storage volumes
aws ec2 create-volume --volume-type io2 --iops 50000 --size 4000 --region us-west-2
# Configure network settings
aws ec2 modify-instance-attribute --instance-id i-1234567890abcdef0 --ebs-optimized

# Volumez approach - declarative configuration
import volumez as vlz

# Define policy
policy = {
  "name": "training-infrastructure",
  "performance": {
    "iops": 1000000,
    "latency_usec": 300
  },
  "encryption": True,
  "resilience": {
    "zones": 1
  }
}

# Create infrastructure
vlz.create_infrastructure(policy)

Technical Implementation Details

The implementation relies on three key components:

1. Cloud Profiling Engine: Conducts deep analysis of cloud provider components, including:

  • Physical server topology
  • Network fabric characteristics
  • Storage media specifications
  • Cost structures

2. Configurator Agent: A lightweight user-space process that:

  • Implements optimal Linux configurations
  • Adjusts queue depths
  • Configures multipathing
  • Sets up resilience schemes

3. Declarative API Layer: Enables infrastructure definition through:

  • PyTorch library extensions
  • Terraform/Bicep modules
  • REST APIs
  • Kubernetes operators

The critical insight is that by precisely understanding cloud components and their interactions, a SaaS system can automatically create optimal Linux configurations without introducing proprietary storage controllers or other middleware.

Performance at Scale

In MLCommons MLPerf Storage 1.0 benchmarks, Volumez demonstrated:

  • 1.14 TB/sec throughput
  • 9.9M IOPS
  • 92% GPU utilization
  • 411 simulated GPUs

These results significantly outperformed traditional storage approaches from vendors like DDN, NetApp, and VAST Data, while showing cost reductions of 27-70% for storage and 50-92% for compute compared to standard AWS configurations.

Implementation Patterns

The platform offers two main deployment patterns that engineers should consider:

1. Hyperconverged Pattern
┌─────────────────┐ ┌─────────────────┐
│ GPU Instance │ │ GPU Instance │
│ ┌─────┐ ┌─────┐ │ │ ┌─────┐ ┌─────┐ │
│ │ GPU │ │ GPU │ │ │ │ GPU │ │ GPU │ │
│ └─────┘ └─────┘ │ │ └─────┘ └─────┘ │
│ │ │ │ │ │ │ │
│ ┌─────────────┐│ │ ┌─────────────┐│
│ │ Local SSDs ││ │ │ Local SSDs ││
│ └─────────────┘│ │ └─────────────┘│
└─────────────────┘ └─────────────────┘

  • Uses local SSDs on GPU servers
  • Best for datasets under 100TB
  • Simpler deployment model
  • Good for persistent clusters

2. Flex Pattern
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GPU Cluster │ │ GPU Cluster │ │ GPU Cluster │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ GPU GPU │ │ │ │ GPU GPU │ │ │ │ GPU GPU │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘

   │                │                │
   │                │                │
   └────────────────┼────────────────┘
                    │
                ┌───▼───┐

┌──────────────┐ │Network│ ┌──────────────┐
│ Storage Node │◄───┤ ├───►│ Storage Node │
│ ┌──────────┐ │ └───────┘ │ ┌──────────┐ │
│ │ SSD SSD │ │ │ │ SSD SSD │ │
│ │ SSD SSD │ │ │ │ SSD SSD │ │
│ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘

  • Scales storage independently from compute
  • Supports datasets >100TB
  • Better for dynamic clusters
  • Higher resilience options

Developer Experience

For developers and data scientists, the most compelling feature is the ability to provision infrastructure directly from Python notebooks:

import volumez as vlz
from torch.utils.data import DataLoader

# Create managed dataset with automated infrastructure
dataset = vlz.datasets.ImageFolder(
    dataset_name="medical-scans",
    mode=vlz.datasets.Mode.Train,
    version=vlz.datasets.Version.Latest,
    credentials=credentials,
    mount="/mnt/dataset"
)

# Infrastructure automatically provisioned and optimized
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Process data normally
for epoch in range(10):
    for i, batch in enumerate(dataloader):
        # train step
        pass

# Infrastructure automatically torn down when complete
dataset.close()

This eliminates the need for separate MLOps or infrastructure teams, allowing data scientists to work independently while maintaining optimal performance.

Engineering Considerations

When evaluating this approach for your own AI infrastructure, consider:

1. Workload characteristics

  • File sizes and counts
  • Data access patterns
  • Batch sizes and preprocessing needs

2. Team structure

  • Data scientist to MLOps engineer ratio
  • Team experience with infrastructure
  • Preferred workflow tools

3. Cost structure

  • GPU utilization targets
  • Infrastructure budget constraints
  • Performance requirements

Conclusion

Volumez's approach represents a significant rethinking of how cloud infrastructure should be configured for AI workloads. By leveraging deep cloud awareness and standard Linux components, their platform offers a compelling alternative to traditional storage-centric approaches.

For engineering teams struggling with AI infrastructure bottlenecks, particularly those dealing with large-scale training workloads, this cloud-aware approach could offer substantial performance improvements while reducing both infrastructure costs and operational complexity.

The most impressive aspect may be how Volumez manages to achieve these outcomes without introducing proprietary middleware or storage controllers - instead, they're simply making better use of what the cloud already provides through intelligent configuration.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

Another nice one Tom...
That Python setup is sick! But how does Volumez handle things if you’re using multiple clouds or some on-prem stuff? Just curious as always :-)

Great question about multi-cloud and hybrid scenarios.

From what Volumez presented, their cloud-aware approach works across multiple cloud providers (AWS, Azure, GCP, and Oracle were shown in their presentation). Their platform catalogs and profiles the unique characteristics of each cloud's infrastructure components to create optimized configurations regardless of which provider you're using.

For hybrid cloud/on-prem scenarios, the concept should work similarly as long as the on-prem environment uses standard Linux components. Their architecture is explicitly designed to avoid proprietary elements in the data path, relying instead on optimizing standard Linux configurations.

The real magic is in their "cloud awareness" catalog that understands the unique characteristics and constraints of different infrastructure environments. I imagine they'd need to extend this profiling to your specific on-prem hardware, but the underlying approach of creating balanced configurations should translate.

I'll follow up with them to get more specifics about hybrid deployments and whether they have any special considerations for bridging between cloud and on-prem resources in the same AI pipeline. I'm curious how their PyTorch integration would handle datasets split between environments.

More Posts

Fabrix.ai automates IT operations through AI agents that reason, decide and act—solving complex operational challenges.

Tom Smith - May 7

Revolutionary open storage cuts AI infrastructure costs 90% while boosting GPU performance by 12%.

Tom Smith - Jul 15

Learn how Hammerspace's Global Data Platform eliminates GPU bottlenecks through unified storage for AI workloads.

Tom Smith - May 7

Learn how developers can reduce cloud and AI costs by 15-20% with automated optimization strategies from Aquila Clouds.

Tom Smith - May 6

Best DevOps Tools and Practices for Building Efficient CI/CD Pipelines on Google Cloud Platform

Aditya Pratap Bhuyan - Apr 13
chevron_left