Visualizing Inference: Training a Medical Imaging U-Net on a 2014 CPU and a GTX 1060

2 8
calendar_today agoschedule2 min read

While the mainstream tech world burns gigawatts of energy scaling massive models on H100 clusters, I decided to pull things down to earth and focus on low-level optimization for a real-world, high-impact domain: automated medical image segmentation.

The objective was to train a custom light U-Net architecture to segment brain tumors using the international BraTS (Brain Tumor Segmentation) dataset.

The Stack & Hardware Control

  • OS: Arch Linux

  • Runtime & Framework: Pure PyTorch

  • Package Management: uv (because legacy tools are too slow for proper systems engineering)

  • Hardware Constraints: A legacy 8-core AMD FX-8370E CPU (Vishera architecture, 95W TDP) paired with a consumer-grade GTX 1060 (6GB VRAM) completely devoid of modern Tensor Cores.

No AWS instances, no managed cloud notebooks. Just raw local hardware control and silicon optimization.


Pipeline & Thermal Optimization

Processing volumetric 3D medical data (stored in multi-channel HDF5 formats) imposes severe stress on the system bus and motherboard VRMs. Under full load, the FX-8370E operated at the edge of its thermal envelope, drawing 89.63 W out of its 94.84 W power limit. Proper custom cooling configuration kept core temperatures stable below 51°C.

To handle potential hardware instability over extended workloads without state loss, a lightweight polling routine handled deterministic checkpointing. The runtime evaluated telemetry boundaries every 5–10 seconds, dumping the model state to disk and maintaining a rolling fallback to the best stable weight vectors if anomalies were detected.


Training Performance Analytics (10 Epochs)

Training overhead stabilized at exactly 24 minutes and 21 seconds per epoch, maintaining a steady throughput of 8.81 iterations per second.

Epoch 10/10: 100%|████████████| 12869/12869 [24:21<00:00, 8.81it/s, loss=0.00272]</p>

[INFO] Epoch 10 | Train Loss: 0.0017 | Val Loss: 0.0017

[INFO] Saved state: ./checkpoints/unet_brats_epoch_10.pth

The loss convergence curves demonstrated high stability, driven by a Dice Loss objective optimized for severe class imbalance (where the background voxels vastly outnumber the target tumor regions):

  • Epoch 1: Train Loss: 0.0457 | Val Loss: 0.0038

  • Epoch 10: Train Loss: 0.0017 | Val Loss: 0.0017

The near-identical alignment of the final training and validation loss values verifies optimal generalization. The model successfully bypassed overfitting and is fully prepared to execute inference on unseen test distributions.


Ground Truth vs. U-Net Prediction: Pixel-Level Conformance

The visual validation of the inference pipeline highlights the precision achieved within just 10 training epochs.

When comparing the manual, hour-intensive annotations generated by expert radiologists (Ground Truth) against the immediate output of the AI pipeline (U-Net Prediction), the geometric alignment is striking.

The network mapped complex, irregular structural boundaries of the tumor core with pixel-perfect accuracy. It accurately preserved sharp edge features and small satellite regions while completely suppressing false positives in healthy brain tissue. The localized Dice Coefficient for highly descriptive slices directly approaches a top-tier ~0.95 boundary.


Next Steps & Deployment Architecture

The compiled model weights occupy a mere 23 Megabytes. This minimal footprint eliminates the need for expensive server-side hardware during deployment. The next phase involves serializing the compute graph via ONNX Runtime and OpenVINO to implement real-time, low-latency CPU inference capable of running locally on a standard workstation right inside a clinical environment.

Longer-term plans involve interfacing this lightweight pipeline with real-time fNIRS/EEG data streams and deploying object detection layers for immediate anomaly localization.

The repository is open-source. True systems engineering relies on peer review and transparent codebases.

https://github.com/alexvoste/forgemed-ai

What are your thoughts on optimizing U-Net execution parameters for edge CPU architectures? Let's discuss performance tuning down in the comments.

258 Points10 Badges2 8
Swedent.co/4fpTf3dL1D
6Posts
2Comments
1Followers
1Connections
Writing ForgeZero: Fixing the mess of modern build systems.
Performance overhead is my personal enemy.
C | Go | x86_64 Asm (3 dialects)
Build your own developer journey
Track progress. Share learning. Stay consistent.
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Ken W. Algerverified - Jun 4

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Optimizing the Clinical Interface: Data Management for Efficient Medical Outcomes

Huifer - Jan 26

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

snapsynapseverified - Apr 20
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

4 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!