How ANDARTIS tamed the resource footprint of local AI, bridged independent data cores, and optimized for the standard consumer Mac.
In my last writing, From Manifesto to Metal, I shared the foundational blueprint of ANDARTIS. I detailed how we rejected the rented cognition of the cloud, chose the unorthodox path of marrying Laravel & NativePHP with Apple’s MLX engine, and built a persistent background daemon to keep neural weights hot in VRAM.
It was a beautiful blueprint. But as any craftsman will tell you: blueprints don't compile on their own.
When philosophy meets consumer hardware, you are immediately confronted with the reality of resource constraints. If a private AI utility requires a top-tier Mac Studio with 128GB of unified memory or a multi-GPU workstation to run, it is not a truly democratic tool. It is merely a different kind of luxury.
To build a sovereign tool for the rogue researcher, the local clinician, the independent writer, and the developer, it must run comfortably on standard, everyday hardware—the ubiquitous 16GB RAM MacBook. Over the last few weeks, we took a step back, looked at the raw metal, and learned how to build a highly optimized, resource-conscious cortex that honors these constraints.
1. The Jet Engine and the Cognitive Compiler
In early local AI experiments, developers often fall into a lazy trap: waking up a heavy Large Language Model for every single document in a directory to extract metadata.
On a 16GB machine, doing this across fifty or five hundred files is a recipe for a frozen UI, saturated unified memory, and fans that sound like a jet engine preparing for takeoff. The user is forced to choose between the privacy of local computation and the responsiveness of their machine.
We realized we were using a high-powered neural network to do what classic, deterministic rules could execute in microseconds. The neural network didn't need to run continuously; it just needed to write the rules once.
This is the philosophy behind the Cognitive Compiler:
- Compile-Time (Layout Analysis): When data is synced, the local model wakes up briefly. It scans two or three sample documents, analyzes their layout structure, and generates a set of regex, table-cell, and anchor-offset rules. These rules are saved into a tiny JSON blueprint.
- Run-Time (High-Speed Ingestion): For the remaining hundreds of files, the local model stays completely asleep. A lightweight, CPU-only engine applies the compiled blueprints. Processing a file drops from seconds of neural compute to milliseconds of regex matching.
- JIT Self-Healing: If a new document format is introduced and the rules yield incomplete metadata, the system automatically fires a single neural extraction pass to heal and re-compile the blueprint on the fly.
By shifting from runtime JIT neural extraction to compile-time layout compilation, we slashed active ingestion memory usage, allowing the machine to remain cool, responsive, and ready for other work.
2. Math, Chunks, and Grounded Accuracy
Another challenge of local models is their inherent weakness with arithmetic and exact data counts.
Because standard RAG architectures divide long documents into overlapping chunks for semantic search, asking a local model a statistical question—like summarizing the most common records or calculating totals—often results in the model counting text chunks rather than actual files. A single file mentioning a topic multiple times becomes artificially amplified, leading to inaccurate summaries.
Furthermore, forcing a quantized local model to parse hundreds of rows of raw data to calculate sums or averages pushes its limited context window to the brink, saturating VRAM.
To solve this, we decoupled the neural conversational layer from a symbolic data-analysis layer:
- Document-Level Registry: We separated chunk-level semantic search from document-level metadata registration. Unique file properties are written to a single-row registry.
- Symbolic Math Offloading: When a user queries metrics or aggregations, the symbolic engine queries the local registry directly, executing precise database math programmatically.
- Context Synthesis: Instead of feeding raw, unparsed data to the local model, we hand it the pre-computed mathematical truth. The model acts as a writer rather than a calculator, generating fluent, natural language responses based on 100% accurate statistics.
This separation of concerns keeps the model's memory footprint tiny and ensures that quantitative questions are answered with absolute precision without spilling over the 16GB RAM limit.
3. Federated Cross-Reasoning: Bridging the Islands
A truly private system shouldn't force users to merge all their files into a single, monolithic database. An independent thinker organizes data in folders: research papers in one, medical histories in another, and financial documents in a third.
However, merging these into a single database increases memory usage, risks data corruption, and ruins search precision. The files must remain in their isolated cores. But how do we compare and reason across them?
We introduced Federated Cross-Reasoning:
- Parallel Local Search: When a user asks a comparative query across multiple folders, the system executes parallel semantic searches across each independent database.
- Context Compilation: The orchestrator retrieves the most relevant chunks from each target core, formats them into a clean markdown structure, and tags them with their originating nodes.
- Single-Pass Synthesis: This compiled, multi-source context is sent to the primary conversational model for a single synthesis pass.
This allows the system to compare records from entirely different directories without merging files or inflating memory usage.
The Art of the Constrained Machine
Building software for local AI is a design discipline of restraint. The cloud tempts developers to be wasteful, throwing infinite memory and expensive API calls at unoptimized code.
On local metal, wastefulness is punished immediately by lagging frames and hot keyboards. Building for a 16GB MacBook forces you to write better software. It demands that you separate neural logic from symbolic computation, compile layouts once, and federate search across small, clean, isolated databases.
The resulting tool is lighter, faster, completely air-gapped, and accessible to anyone with a standard laptop. That is what democratic technology looks like.
Forge the wisdom. Keep it local. Never look back.
Sneak Peek --> https://docs.andartis.it