NDM-TCP: The 100Gbps Ultra-Low Latency Build

posted 2 min read

What's New in the Optimized Build (v2.0.0-100g)

The "Ultra Optimized" build of NDM-TCP represents a radical shift from the standard v1.0 logic. While the standard version prioritizes mathematical precision and readability, this 100Gbps target build prioritizes CPU cache locality and interrupt-context efficiency.

This version is designed specifically for high-throughput environments (100GbE/400GbE) where the CPU budget per packet is measured in nanoseconds.

Github:hejdiss/lkm-ndm-tcp

Key Optimizations vs v1.0

1. Aggressive Quantization (s8/u8)

  • v1.0: Used s32 for inputs and s16 for weights.

  • 100G Build: Converted the entire neural network pipeline to signed 8-bit integers (s8).

Impact: This reduces memory bandwidth requirements by 75%. The entire weight matrix now fits in L1 cache, and vector operations can be performed using standard integer registers without complex casting.

2. Single-Cache-Line Struct (40 Bytes)

  • v1.0: The ndm_tcp struct was packed to fit ICSK_CA_PRIV_SIZE (64 bytes) but utilized most of it.

  • 100G Build: Compressed to exactly 40 bytes.

Impact: This fits comfortably within a single x86 cache line (64 bytes). When the CPU fetches the congestion control state, it gets the entire context (history, weights, flags) in a single memory fetch, eliminating L2/L3 cache misses during the critical path.

3. Bitwise Entropy Calculation

  • v1.0: Used division and loops to calculate Shannon entropy.

  • 100G Build: Replaces division with bitwise shifts based on range magnitude. The loop is unrolled and operates on u8 history data, allowing the CPU to calculate entropy in fewer than 20 cycles.

5. "Stable State" Neural Bypass

The module now includes a nn_skip_counter. If the network entropy is low (stable) and plasticity is high, the algorithm assumes the network state hasn't changed effectively enough to warrant a full forward pass. It reuses the previous cwnd calculation for up to 16 packets, saving massive amounts of CPU time during bulk data transfers.

Important Disclaimers

This optimized version is a specialized low-latency implementation.

  • Precision: The move to 8-bit quantization reduces the "resolution" of the neural network. While sufficient for TCP congestion control (which is inherently noisy), it effectively trades mathematical purity for raw speed.

  • Performance: You should expect a 50-60% reduction in CPU cycles per packet. Throughput gains will be most noticeable on CPU-bound senders driving 100Gbps links.

Compilation Instructions

The Linux kernel build system expects the source file to match the module name defined in the Makefile. To compile this ultra-optimized version, you must rename it to replace the standard source.

Step 1: Backup standard version

mv ndm_tcp_lkm.c ndm_tcp_lkm.c.bak

Step 2: Rename optimized source

cp ndm_tcp_optimized_ultra.c ndm_tcp_lkm.c

Step 3: Compile

make

Step 4: Load Module

make enable

1 Comment

0 votes

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

NDM-TCP: Why v1 Remains the Main Version (Delay Enhancement Experiments)

Muhammed Shafin P - Feb 15

Real-World Analysis of TCP Congestion Control: Reno vs. NDM TCP vs Cubic in a Home Network Environme

Muhammed Shafin P - Feb 16

On NDM-TCP, Open Source, and Ethical Concerns

Muhammed Shafin P - Feb 16

NDM-TCP vs Reno vs Cubic vs BBR: Testing Summary and Recommendations

Muhammed Shafin P - Feb 15
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

2 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!