I Rewrote a Node.js CLI in Rust — 1000x Faster

posted 6 min read

I've been using Claude Code heavily and kept wondering how much I'm actually spending. There's no built-in way to see total token usage or cost history.

The Problem

I was a happy user of ccusage, a Node.js tool for tracking Claude Code token usage. It worked great — until it didn't.

As my Claude Code usage grew, ccusage started taking 43 seconds to scan my session files. What changed wasn't the tool — it was my data:

du -sh ~/.claude/projects
3.4G

find ~/.claude/projects -name "*.jsonl" | wc -l
2772

Claude Code auto-deletes sessions older than 30 days, but heavy usage still leaves thousands of JSONL files totaling gigabytes.

An issue that is not resolved quickly

A quick look at GitHub issues confirmed this was a widespread problem:

  • #718 — Suddenly takes minutes
  • #804 — CPU 300%+, memory 2.4GB
  • #821 — 750 files / 4GB → timeout (30s+)

I tried contributing a cache optimization PR, ran benchmarks — and got no meaningful improvement. Here's why.

Why Node.js Couldn't Fix This

ccusage is written in TypeScript. Let's look at what it does:

// Simplified ccusage flow
import { readFile } from 'node:fs/promises';
import { glob } from 'tinyglobby';

const files = await glob(['**/*.jsonl']);

for (const file of files) {
  const content = await readFile(file, 'utf-8');
  for (const line of content.split('\n')) {
    const parsed = JSON.parse(line);
    // process...
  }
}

Glob files → read each one entirely into memory → split lines → JSON.parse each line. Four compounding bottlenecks:

1. JSON.parse Is Sequential

V8 has steadily improved JSON performance:

  • V8 v7.6 (2019) — Optimized JSON.parse memory allocation
  • V8 v13.8 (Chrome 138, 2025) — SIMD-optimized JSON.stringify

But JSON.parse still processes one byte at a time. It scans input character-by-character, classifies tokens via lookup tables, and builds objects incrementally. The fundamental architecture is sequential.

Compare this to simdjson, which uses SIMD instructions to process 32–64 bytes in parallel:

JSON.parse:  [a][b][c][d]…[e][f][g][h] → 8 operations (sequential)
simdjson:    [a,b,c,d,e,f,g,h]         → 1 operation (SIMD)

simdjson works in two stages:

  • Stage 1 (Structural Discovery): SIMD scans 64 bytes at once, extracts structural character positions ({, [, :, ,) as bitmasks — branchless
  • Stage 2 (Value Materialization): Builds actual values from discovered structure

This separation lets Stage 1 run without branches, maximizing CPU pipeline utilization — achieving gigabytes per second of JSON throughput.

A standalone Node.js binding for simdjson exists, but it hasn't been actively maintained since 2021. While Node.js internally uses simdjson for some operations, it's not exposed as a replacement for JSON.parse. And even if JSON parsing got faster, the single-threaded and GC problems below would remain.

2. Memory Pressure From readFile

const content = await readFile(filePath, 'utf-8');

readFile loads entire files into memory. For 2,772 files, each one gets:

  1. Loaded into a Buffer
  2. UTF-8 decoded into a String
  3. split('\n') creates yet another array of strings

You could use createReadStream for streaming, but JSON.parse needs complete strings anyway — so it doesn't fundamentally help.

3. libuv Thread Pool Ceiling

Node.js file I/O runs on libuv's thread pool, not the event loop:

Node.js I/O Architecture:
┌──────────────────┐
│ Event Loop       │ ← single-threaded (JS execution + callbacks)
└────────┬─────────┘
         │
┌────────▼─────────┐
│ libuv threadpool │ ← default: 4 threads (fs operations)
│ [1] [2] [3] [4]  │
└──────────────────┘

The default is 4 threads. You can increase it with UV_THREADPOOL_SIZE, but I tested that:

UV_THREADPOOL_SIZE=4 (default)  → 43.4s, 76% CPU
UV_THREADPOOL_SIZE=64           → 42.2s, 80% CPU  (~3% improvement)
UV_THREADPOOL_SIZE=128          → 33.6s, 100% CPU (~22% improvement)

32x more threads → only 22% faster. CPU hit 100% at 33 seconds. The bottleneck isn't I/O — it's JSON.parse running on the single main thread.

You could use Worker Threads for parallel parsing, but transferring data between threads requires serialization — you'd pay the parsing cost twice.

4. GC Overhead

Parsing 3GB of JSONL creates millions of temporary objects. V8's GC has to clean them up, causing stop-the-world pauses:

GC trace results (3GB / 2,227 files):
Total GC events: 504
Peak heap memory: 378MB
Longest Major GC pause: 135ms

# Major GC (Mark-Compact) log excerpt
605 ms:  Mark-Compact 195.3 → 163.4 MB, 135.79ms pause
1051 ms: Mark-Compact 273.0 → 239.9 MB
1378 ms: Mark-Compact 378.9 → 225.5 MB
1650 ms: Mark-Compact 390.2 → 210.9 MB
1886 ms: Mark-Compact 375.3 → 161.7 MB

V8's Orinoco GC uses incremental marking and concurrent sweeping to minimize pauses, but with this volume of data, 504 GC events in a single run — roughly one every 100ms — adds up.

Summary

Every fix I tried hit a wall:

  • Sequential JSON.parse → simdjson Node.js binding? Unmaintained since 2021.
  • libuv 4 threads → UV_THREADPOOL_SIZE=128? 22% improvement ceiling.
  • Single-threaded parsing → Worker Threads? Serialization overhead.
  • GC overhead → 504 GC events, 135ms max pause. No fix available.
  • Combined: cache optimization PR? No meaningful improvement.

This wasn't a single bottleneck — it was multiple structural limitations compounding.

Why Rust?

Just rewrite it in another language — sure, but which one?

  • Go — Concurrent GC with low pause times, but no zero-copy SIMD JSON ecosystem comparable to Rust's simd-json
  • C++ — No GC, full SIMD and threading support. But no memory safety for handling thousands of files, and cross-compilation and npm distribution are both painful.
  • Rust — No GC + memory safety + simd-json/rayon ecosystem + cargo cross for 5 platforms + npm distribution via binary wrapper. Everything I needed.

And honestly — I wanted to learn Rust properly, and this was the perfect project for it.

The Rewrite

I rewrote to the stack below. I used ccusage well before, so I referred to ccusage.

  • JSON parsing: JSON.parse (sequential) → simd-json (SIMD)
  • File discovery: tinyglobbyglob
  • Parallelism: libuv 4 threads → rayon (all cores)
  • Memory: V8 GC → Ownership + zero-copy
  • UI: Terminal output → ratatui TUI dashboard

simd-json: Zero-Copy Parsing

// Zero-copy JSON parsing - no heap allocation for strings
let data: ClaudeJsonLine = simd_json::from_slice(&mut line_bytes)?;

Rust's simd-json is a port of simdjson. Pass &mut [u8] to from_slice and it parses in-place — borrowed string references, no allocations. This is zero-copy parsing.

rayon: Embarrassingly Simple Parallelism

// Parallel file processing: just change .iter() to .par_iter()
let entries: Vec<UsageEntry> = files
    .par_iter()
    .flat_map(|f| parse_file(f))
    .collect();

rayon makes data parallelism trivial. Change .iter() to .par_iter() and it automatically distributes work across all CPU cores using work-stealing.

The key difference from Node.js: each Rust thread independently reads and parses files, then collects results. No serialization overhead between threads. The "parallel parsing requires serialization" dilemma simply doesn't exist.

Multi-CLI Support

Toktrack parses session data from three AI coding CLIs:

  • Claude Code — JSONL format, ~/.claude/projects/
  • Codex CLI — JSONL format, ~/.codex/sessions/
  • Gemini CLI — JSON format, ~/.gemini/tmp/*/chats/

Each parser implements a shared CLIParser trait, and all files are processed in a single parallel pass. Support for OpenCode and GLM CLI is planned.

Architecture

Cold path (first run): Full glob scan → parallel SIMD parsing → build cache → aggregate.

Warm path (cached): Load cached summaries → recompute only today (past dates served from cache) → merge → aggregate.

Data Preservation

The cache isn't just for performance. AI CLIs silently delete session data — Claude Code defaults to cleanupPeriodDays: 30, wiping JSONL files after 30 days. When those files are gone, your token usage and cost history goes with them.

toktrack's daily cache solves this. Past dates are immutable — once a day is summarized, the result is never modified. Even if Claude Code deletes the original session files a month later, your cost history remains intact in ~/.toktrack/cache/.

Results

  • ccusage (Node.js): ~43s
  • toktrack (Rust): ~0.04s — ~1000x faster
Mode Time
Cold start (no cache) ~1.0s
Warm start (cached) ~0.04s
Throughput ~3 GiB/s with rayon + simd-json

Measured on Apple Silicon (M-series), 2,772 JSONL files, 3.4 GB total.

What I Learned

Node.js is excellent for many things, but parsing gigabytes of JSON across thousands of files exposes structural limits that no amount of optimization can fix:

  • JSON.parse is fundamentally sequential
  • Scaling libuv threads hits diminishing returns fast
  • Worker Thread parallelism comes with serialization tax
  • GC pressure grows with data volume

Rust's combination of SIMD JSON parsing, zero-cost parallelism, and no GC made the 40x speedup possible — not by clever algorithms, but by removing the bottlenecks entirely.


toktrack is open source. Feedback and contributions welcome!

1 Comment

1 vote
0

More Posts

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

snapsynapseverified - Apr 20

The End of Data Export: Why the Cloud is a Compliance Trap

Pocket Portfolio - Apr 6
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!