I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

Question

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

calendar_todayMay 30 • schedule6 min read

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

After building docnest-ai — a hybrid RAG engine for Python — the next logical question was: what does a great end-user app built on top of it actually look like?

That question led me to build Knovex: a local-first, AI-powered desktop knowledge base that runs entirely on your machine. No cloud uploads. No subscriptions. No data leakage. Just drop in your documents, ask questions, and learn.

This post covers the architecture decisions, the problems I hit, and the interesting technical bits. If you want to skip straight to the app: tailorgunjan93.github.io/knovex

Why build a desktop app in 2026?

Every AI knowledge tool I tried had the same deal: your documents leave your machine. Legal contracts, research notes, personal journals — all uploaded to some company's inference server. The privacy trade-off felt wrong.

The local-first principle changes the threat model entirely:

Your files never leave your machine unless you choose to enable cloud features
The app works fully offline (use Ollama for a zero-network setup)
API keys are encrypted at rest with Fernet AES-128, readable only by your OS account

The constraint also forced better engineering. When you can't lean on a cloud backend, you have to make the local stack actually fast.

Architecture overview

Knovex is a fully decoupled tri-layer app:

┌─────────────────────────────────────────┐
│  Electron 33 (desktop shell)            │
│  ┌─────────────────────────────────┐    │
│  │  React 18 + MUI v6 + TypeScript │    │
│  │  TanStack Query v5 + Zustand    │    │
│  └──────────────┬──────────────────┘    │
└─────────────────│───────────────────────┘
                  │  REST + SSE  (localhost:8765)
┌─────────────────▼───────────────────────┐
│  FastAPI + Python 3.11                  │
│  docnest-ai (hybrid RAG engine)         │
│  SQLite WAL + FTS5                      │
│  LiteLLM (multi-provider LLM bridge)    │
└─────────────────────────────────────────┘

The frontend is a pure API consumer — it knows nothing about RAG, embeddings, or LLMs. All intelligence lives in the Python backend. This made it very easy to swap out components independently.

Why Electron?

Electron gets a bad reputation, but for a privacy-first desktop app it's the right call:

Single installer ships backend binary (PyInstaller) + frontend + Electron in one .exe/.dmg/.AppImage
The backend process is spawned as a child process, communicates over localhost
Window state, tray, native OS file dialogs — all handled properly
Cross-platform with one codebase

The binary is ~85-92 MB depending on platform. Not tiny, but users get zero setup — no Python, no Node, no CLI gymnastics.

The RAG engine: docnest-ai

Rather than naive chunking (split every 512 chars → embed → hope), docnest-ai runs a 6-stage normalization pipeline:

Structure extraction — reads heading hierarchy, tables, lists (Docling or PyMuPDF)
Section assignment — every heading becomes a navigable §section
Table normalization — { caption, headers, rows[] } JSON, never loses column context
Section summarization — LLM called once per document
Document intelligence — summary, key numbers, insights
Embedding + quantize — BM25 keywords + float16 vectors

Stages 1–3 and 6 run locally at zero LLM cost. Stages 4–5 call an LLM once per document at ingest time. Every future query benefits from that upfront investment for free.

Query resolution: five layers

The query engine tries cheaper layers first before escalating:

Layer	Mechanism	Tokens	Latency
L0	Pre-computed summary/insights	0	< 1ms
L1	BM25 + cosine → navigate to §section	0	< 20ms
L2	Section-scoped LLM	~300	1–3s
L3	Multi-section synthesis	~900	2–5s
L4	Full-document fallback	~4000+	5–15s

In practice, L0+L1 answer ~70% of real-world questions at zero LLM cost. You only pay when you genuinely need the model.

Semantic search (v0.7.0+)

For Knovex v0.7.0 I added hybrid semantic search on top:

# ONNX-based local embeddings (all-MiniLM-L6-v2, ~45 MB, one-time download)
# OR OpenAI text-embedding-3-small via API

# Results fused with Reciprocal Rank Fusion (RRF):
# score = 1/(k + rank_fts5) + 1/(k + rank_ann)

RRF fusion handles the case where BM25 ranks a document high on keyword match but the semantic model ranks it high on conceptual similarity. The union tends to beat either individually.

Average query latency on a typical KB is still sub-millisecond for the FTS5 path and ~0.9s end-to-end including the LLM call on an M-series Mac.

Learn Mode: turning documents into learning sessions

This was the most fun feature to build. The idea: instead of just answering questions, the app can generate structured learning content from any document or topic.

Nine formats, all streaming via SSE:

Quiz — interactive MCQ with XP rewards per question
Flashcards — spaced repetition with interval scheduling
Mind Map — collapsible JSON tree rendered with D3
Timeline — chronological events extracted from the text
Guided — step-by-step walkthrough via GuidedViewer
Story — narrative markdown retelling of the content
ELI5 — explain like I'm five
Brainstorm — creative connections and lateral ideas
Speed Learn — bullet-point summary for fast review

The JSON formats (Quiz, Flashcards, Mind Map, Timeline) use a two-phase approach: LLM generates structured JSON → parse → re-stream the parsed results. Text formats (Story, ELI5, etc.) stream in real-time token by token.

Gamification

I added XP, level progression (10 tiers), daily streaks, and achievement badges. This was partly experimental — does adding game mechanics to a local productivity tool actually improve usage? Anecdotally yes: the streak counter creates a small daily habit pull.

The Progress Page (v0.8.0) shows:

26-week activity heatmap (sessions per day, colour-coded)
Learning velocity chart (sessions/week + active days/week dual-axis)
XP level with badge
Week-over-week session delta

Design patterns used throughout

Adapter pattern (anti-corruption layer)

Every third-party dependency sits behind a swappable interface:

# backend/adapters/llm_client.py
class ILLMClient(Protocol):
    async def complete(self, messages: list[dict], **kwargs) -> str: ...
    async def stream(self, messages: list[dict], **kwargs) -> AsyncIterator[str]: ...

class LiteLLMAdapter(ILLMClient):
    """Wraps litellm — the only place litellm is imported"""
    ...

class StubLLMClient(ILLMClient):
    """Used in tests — zero network calls"""
    ...

Same pattern for: HTTP client (httpx), PDF parser (PyMuPDF / Docling), web search (DuckDuckGo / Serper / Brave), paragraph parser (python-docx).

This made testing painless — all 61 E2E tests mock at the adapter boundary.

Strategy + plugin registration for parsers

_PARSERS: dict[str, type[IFileParser]] = {}

def register_parser(ext: str):
    def decorator(cls):
        _PARSERS[ext] = cls
        return cls
    return decorator

@register_parser(".pdf")
class PDFParser(IFileParser): ...

@register_parser(".docx")
class DocxParser(IFileParser): ...

Adding a new file format means writing one class and adding one decorator. Zero changes to the orchestration layer.

EventBus for decoupled notifications

# In-process typed EventBus — no external dependencies
bus = EventBus()

@dataclass
class FileIngested:
    file_id: str
    kb_id: str
    chunk_count: int

bus.emit_typed(FileIngested(file_id=..., kb_id=..., chunk_count=42))

The watcher service (which detects stale/missing files) communicates with the KB service through events rather than direct calls. This kept the service layer clean.

Challenges worth noting

SQLite WAL mode + concurrent async writes — FastAPI runs async, and SQLite's WAL mode handles readers well but writers queue. I had to add retry logic with exponential backoff for the ingestion pipeline, which can run as a background task while chat is active.

PyInstaller + Python 3.11 + ONNX — packaging the ONNX runtime into a PyInstaller binary was the most painful part of the v0.7.0 release. The model weights need to be bundled correctly, paths resolved at runtime via sys._MEIPASS. Worth documenting if you're going down this path.

SSE streaming through Electron's IPC — Electron's fetch API handles SSE properly, but the preload script needed explicit keep-alive handling to prevent the renderer from killing long-running streams during Learn Mode generation (which can take 10–30 seconds for complex documents).

Windows SmartScreen — unsigned NSIS installers get flagged. Adding instructions to the download page for "More info → Run anyway" reduced support questions significantly.

What's next

Phase 2 of Knovex moves toward cloud + organisation features:

Cloud Portal — web admin for org key management and user management
3 deployment modes — Personal (own keys) / Organisation (portal-managed) / Self-hosted (Docker)
LangGraph agent orchestration — beyond single-turn Q&A
Visual workflow builder — chain operations on your KB
Mobile app — React Native, same backend API
Plugin/connector marketplace — Notion, Confluence, GitHub, etc.

Try it

App: tailorgunjan93.github.io/knovex — free one-click installer for Windows, macOS, Linux
GitHub: github.com/tailorgunjan93/knovex
RAG engine: pip install docnest-ai

MIT licensed. v0.10.0 is stable with 61 E2E tests passing.

Happy to answer questions about any part of the stack in the comments.

16 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Next Big Creative · Answer 1 · 2026-06-01T02:06:56+0000

Next Big Creative • May 31

Love the local first approach. Have you noticed users caring more about privacy or performance?

Gunjan Tailor • Jun 2

@[Next Big Creative] Great question — honestly, it's been both, but for different user profiles. The privacy-first crowd comes in knowing exactly what they want: zero cloud, no telemetry, full control. They're usually handling legal docs, research notes, or internal company data. Performance is secondary for them — they'll wait 2 seconds for a query if they know the data never left their machine.
The performance-first users discover the privacy benefits after the fact. They come for the ~1ms FTS5 latency or the offline Ollama setup, and then realise cloud RAG tools were slowing them down AND sending their data somewhere. That's actually been the more interesting conversion — people who didn't start out caring about privacy, and now do.
So the short answer: privacy brings them in, performance makes them stay.

Ken W. Algerverified · Answer 2 · 2026-06-01T20:19:50+0000

Ken W. Algerverified • Jun 1

Gunjan, this is a phenomenal write-up and an incredible masterclass in local-first systems architecture.

What I love most about Knovex is that it completely rejects the 'Digital Attic' trap. Most desktop AI tools just blindly dump messy, uncurated markdown and PDF snippets into an embeddings base and pray that semantic search can figure it out at runtime. All that does is saddle the user with a massive, recurring Prose Tax—burning local compute and ballooning context windows just to re-explain structural history the system should already know.

Your 6-stage normalization pipeline in docnest-ai is the exact right antidote. By investing in structure, section assignment, and table normalization at the ingestion boundary, you've built a true Forensic Ingestor. Pre-paying that precision so that L0 and L1 can resolve 70% of queries at zero token cost is pure engineering maturity.

Did you hit any specific edge cases when normalizing highly irregular table structures into that clean JSON schema before embedding them? That's usually where the deterministic layer gets tested the hardest.

Show 8 previous replys

Gunjan Tailor • Jun 9

Ken, this genuinely made my week. You've articulated the thesis better than I did — "knowing exactly when not to call an LLM" is the whole bet. The Observer's Tax framing came straight out of watching token bills balloon on queries that were really just "sum this column."

Since you're clearly tuned into this: I split the deterministic extraction layer out as its own open-source engine — DocNest (https://github.com/tailorgunjan93/docnest). Knovex is the desktop app on top, but I pulled the engine apart on purpose so the table-extraction / §-section / zero-token factual path could be reused and inspected independently. Would genuinely value your eyes on where the deterministic/LLM boundary should sit — you're exactly the person I'd want poking holes in it.

Ken W. Algerverified • Jun 9

@[Gunjan Tailor] huge congratulations on pulling DocNest out as its own open-source engine! That is a massive contribution to the local-first community. Decoupling the deterministic extraction layer from the UI application layer is exactly how we move past monolithic, opaque AI tooling and start building verifiable data pipelines.

I would absolutely love to take a deep dive into the repository and poke at the architecture.

Regarding your question on where that deterministic/LLM boundary should sit, my immediate instinct leans toward a strict Gated Egress model. If we treat the deterministic layer as the absolute authority, the boundary should be defined by three clean criteria:

Factual and Aggregative Primacy: If a query can be resolved by a structured relational lookup, an AST structural path, or a mathematical operation (like your table column summation), the LLM should never be invoked. The deterministic code layer executes, returns the clean snippet, and completely bypasses the model. This is your zero-token fast path.
The LLM as an Ephemeral Narrator: The only time the boundary shifts to the probabilistic layer is when the user explicitly requests synthesis, semantic cross-referencing, or natural language translation of the data. Even then, the LLM is only handed the evidence bundle that your deterministic code layer has already extracted, validated, and frozen.
The Lineage Gate: The moment data crosses from your deterministic engine over to the LLM narrator, a non-repudiable audit trace should be generated. The engine should bind the raw data hash, the extraction metadata, and the exact slice passed to the model into a single receipt so that downstream drift can always be debugged.

I'm going to pull down the docnest repo tonight and look at how you're handling the structural extraction states under the hood. You've built an incredible foundation here. Let's map out exactly where the code ends and the narrator begins.

Gunjan Tailor • Jun 14

@[Ken W. Alger] The Gated Egress framing maps almost exactly onto what I landed on. L0 (FTS5 keyword) and L1 (precomputed table extraction) are your Factual Primacy layer — if either resolves the query above confidence threshold, the LLM never fires. L2 (ANN + LLM) only triggers when L0/L1 return below-confidence or the query is clearly synthesizing across sections. The Lineage Gate piece is where I'd love your eyes most — each answer currently carries its section_id chain but I haven't formalized a confidence gate on that leg yet. Open an issue on the DocNest repo with your three criteria and let's stress-test the boundary together.

Hussein Mahdi · Answer 3 · 2026-06-03T15:21:32+0000

Hussein Mahdi • Jun 3

Beautiful, it holds promise if implemented and improved correctly while preserving privacy.

Gunjan Tailor • Jun 5

Thanks Hussein — that's exactly the bar I'm holding it to. "Privacy-preserving" only counts if it survives real use, so the rule is simple: nothing leaves the machine unless you flip a switch, and even then you see exactly what's sent. The hard part isn't the promise, it's keeping it true as features grow — but that constraint is what keeps the design honest. More to come.

	Local-First: The Browser as the Vault Pocket Portfolio - Apr 20
	Architecting a Local-First Hybrid RAG for Finance Pocket Portfolio - Feb 25
	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9
	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

Why build a desktop app in 2026?

Architecture overview

Why Electron?

The RAG engine: docnest-ai

Query resolution: five layers

Semantic search (v0.7.0+)

Learn Mode: turning documents into learning sessions

Gamification

Design patterns used throughout

Adapter pattern (anti-corruption layer)

Strategy + plugin registration for parsers

EventBus for decoupled notifications

Challenges worth noting

What's next

Try it

16 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Local-First: The Browser as the Vault

Architecting a Local-First Hybrid RAG for Finance

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

More From Gunjan Tailor

Your .NET RAG stack hides a Python sidecar. I built the engine that removes it.

I built a PDF parser that actually preserves table structure for RAG: here’s why it matters

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,733 amazing developers

Don't have an account? Sign up

OR

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

Why build a desktop app in 2026?

Architecture overview

Why Electron?

The RAG engine: docnest-ai

Query resolution: five layers

Semantic search (v0.7.0+)

Learn Mode: turning documents into learning sessions

Gamification

Design patterns used throughout

Adapter pattern (anti-corruption layer)

Strategy + plugin registration for parsers

EventBus for decoupled notifications

Challenges worth noting

What's next

Try it

16 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Local-First: The Browser as the Vault

Architecting a Local-First Hybrid RAG for Finance

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

More From Gunjan Tailor

Your .NET RAG stack hides a Python sidecar. I built the engine that removes it.

I built a PDF parser that actually preserves table structure for RAG: here’s why it matters

Related Jobs

Commenters (This Week)