We shipped v4.1.0 with a clear mandate: eliminate heap allocations from every critical path,
harden the syscall layer, and make the build engine deterministic at hardware limits.
v4.2.0 delivers on that — and then goes further.
This release drops CGO entirely, ships a native Windows syscall driver, and pushes
the assembler hot-path through SIMD rewrites and sync.Pool recycling. The result is a
build orchestrator that runs at 637 ms mean against Ninja's 1.813 s on a 1000-module
C project. That is a 2.84x lead, sustained across 10 runs with sub-35 ms range variance.
Benchmark: ForgeZero vs Ninja, 1000 C Modules
| Tool | Mean | Range | Runs |
| fz (ForgeZero) | 637.3 ms | 626.6 ms — 685.5 ms | 10 |
| ninja | 1.813 s | 1.800 s — 1.834 s | 10 |
2.84x faster. Consistent. Reproducible.
Hardware: Intel i5-10310U. Cold filesystem cache. No artificial throttling.
The variance on fz is 58.9 ms peak-to-peak. On ninja it is 34 ms — but starting from
nearly three times higher. ForgeZero is not only faster, it is stable.
The previous release already showed 3.25x against make -j4 at 100 modules. This holds
as the module count scales to 1000 against a purpose-built parallel tool. That is not a
benchmark trick. That is architecture.
The Big Change: CGO is Gone
v4.1.0 still carried CGO at the entrypoint and in the C plugin loader. Every build that
touched cplugin dragged in the C toolchain, broke cross-compilation, and introduced
non-determinism in the binary ABI.
v4.2.0 removes it completely.
The mechanism is GoContext — a pure Go struct that replaces the old C-coupled parameter
passing interface in cplugin. Platform-specific loaders are now split by build tag:
//go:build linux — Unix dynamic loader via CGO (for users who opt in)
//go:build windows — Native DLL loader via syscall and golang.org/x/sys/windows,
zero CGO dependency
- fallback — no-op loader for non-CGO builds, compiles clean on any target
The entrypoint (cmd/fz) no longer imports "C" at all. The binary is now fully
cross-compilable from any host to any target without a C toolchain present.
C module structs were also aligned to 64-byte cache lines to eliminate false sharing
on multi-core dispatch.
Assembler: SIMD, sync.Pool, and One Allocation Budget
The assembler hot-path in v4.1.0 was fast. In v4.2.0 it operates under a strict
zero-allocation budget on every repeated call.
Three changes made this possible:
SIMD + branchless lookups for whitespace and comment stripping.
The previous implementation used range loops with conditional branches. The rewrite
uses branchless lookup tables and SIMD-width processing to strip comments and normalize
whitespace without a single branch misprediction in the common case.
Parser recycling via sync.Pool.
Each parse call previously allocated a new parser struct. The pool holds pre-warmed
parser instances and recycles them across goroutines. Allocation churn in the assembler
is now flat under sustained parallel load.
splitArgs as a parser method with pre-allocated slice recycling.
The argument splitter was a free function that allocated a new slice on every call.
It is now a method on the pooled parser, reusing the backing array across invocations.
The slice is reset, not reallocated.
Compiler flags are parsed once via sync.Once lazy initialization. Repeated calls
to the flag parser cost a single atomic read.
FastCopy — the inner loop for instruction stream copying — was rewritten as a
branchless unrolled function using unsafe.Pointer arithmetic. No bounds checks.
No interface dispatch. Direct memory throughput.
Linker: On-Disk Symbol Cache, No nm
The linker previously invoked the nm utility to resolve symbol tables. nm is a
process fork per object file. At 1000 modules that is 1000 forks, each with exec
overhead, pipe setup, and string parsing.
v4.2.0 introduces a co-located on-disk symbol cache. Symbol data is written alongside
the object file on first build and read directly on subsequent builds. The nm binary
is never invoked in the hot path.
Object deduplication was simplified from a custom mmap-based structure to a native
Go map. The mmap routines are gone. The map is faster for the deduplication workload
and eliminates an entire category of platform-specific edge cases.
Response file generation was refactored to use bufio.Writer with f.Sync removed.
The sync was a correctness hedge that was never needed — the linker already operates
under an ordering guarantee from the task graph.
Builder and Utils: WalkDir, Zero-Copy Loops, Raw Syscalls
The builder's source discovery was rewritten around WalkDir with explicit
filepath.SkipDir support in the custom SYS_GETDENTS64 walker. Directory subtrees
that cannot contain build targets are pruned at the kernel boundary, not in userspace.
CleanDir was deduplicated — it was being called multiple times per build target in
the previous implementation. Slice allocations in the task dispatch worker were
eliminated; the dispatch loop now operates on pre-allocated task structs with
zero heap writes in the hot path.
Platform-specific file hashing:
- Linux: Direct
SYS_OPENAT loops with defer-based cleanup and no os.File allocation
- macOS: Low-level
mmap write interfaces, resource management tuned to Darwin's
unified buffer cache behavior
- Windows / non-UNIX: Updated fallback stubs that compile clean and behave correctly
Unused pools were purged. Digest-to-string conversions were rewritten to use
stack-allocated byte slices instead of fmt.Sprintf.
What v4.1.0 Established
For context: v4.1.0 shipped the HADES ELF emitter with correct .symtab ordering
(local symbols before global, strict linker compliance), deterministic relocation
calculation for call and jmp offsets, and the AEGIS security layer with TOCTOU
mitigation, atomic SBOM generation, and secureVendorPath symlink validation.
The allocation regression suite and golangci-lint enforcement were introduced in
that release and remain green.
v4.2.0 builds on that foundation without breaking it.
Numbers That Matter
- 2.84x faster than Ninja at 1000 C modules
- 3.25x faster than
make -j4 at 100 modules (v4.1.0, still holds)
- ~1.18 GB/s sustained copy throughput on i5-10310U, zero GC interference
- 0 allocs/op, 0 B/op in
copyFileHot micro-benchmark
- 0 CGO dependencies in the default build path
- 100%
golangci-lint compliance, strict config
Get It
git clone https://github.com/forgezero-cli/forgezero
cd forgezero
go build -o fz ./cmd/fz
No C toolchain required. No network calls at build time. Reproducible output on
amd64 and arm64 across Linux, Windows, and macOS.
Source, benchmarks, and the full commit log are in the repository.
If you are running large C projects and measuring build latency, the numbers are
worth verifying on your own hardware.
ForgeZero is a low-level build orchestrator written in Go. It targets deterministic,
zero-allocation execution for large-scale C and assembly projects.