Private, Offline RAG in Python with Ollama: A Self-Hosted, GDPR-Friendly Build

Question

Private, Offline RAG in Python with Ollama: A Self-Hosted, GDPR-Friendly Build

calendar_todayJun 8 • schedule11 min read

Most RAG tutorials quietly ship your documents to someone else's cloud. You paste a contract, a customer list, or an internal wiki into an embedding API, and that data now lives on a server you don't control, in a jurisdiction you didn't choose. For a side project, fine. For a company handling personal or confidential data under GDPR, that's a problem you have to answer for.

This guide builds the opposite: a private, offline RAG pipeline in Python with Ollama, where the language model, the embeddings, and the vector store all run on your own machine. Nothing leaves the box. By the end you'll have one small, framework-light script that answers questions over your local documents — fully self-hosted, with a clear path to running it on an EU VPS for data residency.

No LangChain, no managed vector database, no API keys. Just ollama, numpy, and about 70 lines of Python you can actually read.

Why run RAG locally? Privacy, cost, and data sovereignty

There are three honest reasons to keep RAG on-premises, and "it's cool" isn't one of them.

Privacy and data residency. The moment you send text to a hosted embedding or chat API, you've performed a data transfer to a third-party processor — and if their servers sit outside the EEA, you've also triggered the international-transfer rules of GDPR. Keeping everything local sidesteps that entire category of risk: there is no transfer because there is no third party. For regulated data (health, finance, HR, legal), "the data physically never left our infrastructure" is the cleanest compliance story you can tell.

Cost at scale. Per-token pricing is great until you're embedding a few million chunks and re-embedding them every time you tweak your chunking. Local inference turns a variable, usage-coupled bill into a fixed hardware cost. If you run a steady workload, the math flips in favor of self-hosting fast.

Control and longevity. No surprise deprecations, no rate limits mid-refactor, no model swapped out from under you. The weights you downloaded today will still run next year, offline, on a plane, in an air-gapped network.

Local LLMs stopped being a hobbyist curiosity somewhere in 2025. Ollama made pulling and running open-weight models a one-liner, and privacy is now consistently cited as a leading barrier to enterprise cloud-LLM adoption — which is exactly the gap local inference fills. If you want the structured, end-to-end version of this topic, there's a full course on local LLMs, privacy, and self-hosting (Romanian-language) that goes deeper than one tutorial can; I'll link a few relevant ones as we hit each topic.

What we're building

One Python script, rag.py, that:

Reads .txt and .md files from a local ./docs folder.
Splits them into overlapping chunks.
Embeds each chunk with a local embedding model via Ollama.
Stores the vectors in memory with NumPy (no external DB).
On each question, embeds the query, finds the most similar chunks by cosine similarity, and feeds them to a local chat model as grounded context.

That's the whole RAG loop: retrieve, augment, generate — with every step running on your hardware. RAG itself (chunking, embeddings, retrieval quality) is a deep subject; if you want the theory behind the choices here, there's a dedicated RAG course that covers it properly.

Step 1 — Install Ollama and pull your models (and check the license)

Install Ollama from ollama.com, then pull one chat model and one embedding model:

# A capable, permissively licensed chat model (Apache 2.0)
ollama pull qwen3:8b

# A fast, high-quality embedding model
ollama pull nomic-embed-text

Verify the daemon is up:

ollama list
ollama run qwen3:8b "Say hello in one short sentence."

A word on licenses, because this is where people get burned. "Open-weight" does not mean "do whatever you want commercially." As of 2026:

Qwen3 ships under Apache 2.0 for its open variants — genuinely permissive, commercial use included.
Gemma 3 uses a custom Google license (the Gemma Terms of Use), not Apache 2.0, with its own use restrictions.
Llama 4 uses Meta's community license, which is custom and has conditions (notably around very large deployments and naming).

The takeaway: before you ship anything commercial, open the model's card and read its actual license. Don't trust a blog post (including this one) as your legal source — model terms change. We default to Qwen3 here precisely because Apache 2.0 keeps the legal surface small.

Install the Python dependencies:

pip install ollama numpy

Step 2 — The one-line migration: Ollama's OpenAI-compatible API

Here's the detail that makes adoption painless. Ollama exposes an OpenAI-compatible endpoint, so if you already have code written against the OpenAI SDK, you don't rewrite it — you repoint it:

from openai import OpenAI

# The ONLY changes from a normal OpenAI setup:
client = OpenAI(
    base_url="http://localhost:11434/v1",  # your local Ollama
    api_key="ollama",                       # required by the SDK, but ignored locally
)

resp = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Explain RAG in one sentence."}],
)
print(resp.choices[0].message.content)

Same chat.completions.create, same response shape — but it runs on localhost and never makes an outbound request. That's your migration path for an existing app: swap base_url and api_key, point model at a local model, done. Building production apps against these SDKs (and abstracting over multiple providers cleanly) is its own discipline — covered in the course on building AI apps with Python and the OpenAI/Anthropic SDKs.

For the rest of this tutorial I'll use the native ollama Python library instead, because it's the most direct way to call a local model — but everything below works through the OpenAI-compatible client too.

Step 3 — Local embeddings and an in-memory vector store

Embeddings turn text into vectors so we can measure semantic similarity. With Ollama, that's one call to a local model — nomic-embed-text returns 768-dimensional vectors and never phones home.

import ollama
import numpy as np

EMBED_MODEL = "nomic-embed-text"

def embed_texts(texts: list[str]) -> np.ndarray:
    """Embed a batch of strings with a local Ollama model."""
    resp = ollama.embed(model=EMBED_MODEL, input=texts)
    return np.array(resp["embeddings"], dtype=np.float32)

Now the documents. We load every .txt and .md file under ./docs, then split each into overlapping character chunks. Overlap matters: it keeps a sentence that straddles a chunk boundary from being cut in half and losing its meaning.

import os
import glob

DOCS_DIR = "docs"
CHUNK_SIZE = 800       # characters per chunk
CHUNK_OVERLAP = 100    # characters shared between neighbours

def load_documents(folder: str) -> list[tuple[str, str]]:
    paths = (
        glob.glob(os.path.join(folder, "**", "*.txt"), recursive=True)
        + glob.glob(os.path.join(folder, "**", "*.md"), recursive=True)
    )
    docs = []
    for path in paths:
        with open(path, "r", encoding="utf-8") as f:
            docs.append((path, f.read()))
    return docs

def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    chunks, start = [], 0
    step = max(1, size - overlap)  # guard against overlap >= size
    while start < len(text):
        piece = text[start:start + size].strip()
        if piece:
            chunks.append(piece)
        start += step
    return chunks

We build the index once at startup: every chunk plus its source path and its vector, all held in memory.

def build_index(folder: str):
    docs = load_documents(folder)
    if not docs:
        raise SystemExit(
            f"No .txt or .md files found in ./{folder} — add some documents first."
        )
    chunks, sources = [], []
    for path, text in docs:
        for ch in chunk_text(text):
            chunks.append(ch)
            sources.append(path)
    embeddings = embed_texts(chunks)  # shape: (n_chunks, 768)
    return chunks, sources, embeddings

No Pinecone, no Chroma server, no network. For a knowledge base up to tens of thousands of chunks, NumPy in memory is genuinely enough — and it keeps the "nothing leaves the machine" guarantee absolute.

Step 4 — Retrieval and the RAG loop

Retrieval is cosine similarity between the query vector and every chunk vector. We normalize both sides, take the dot product, and grab the top matches. The small + 1e-10 avoids a division-by-zero if a vector is all zeros.

TOP_K = 4

def cosine_search(query: str, chunks, sources, embeddings, k: int = TOP_K):
    q = embed_texts([query])[0]
    q = q / (np.linalg.norm(q) + 1e-10)
    mat = embeddings / (np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-10)
    scores = mat @ q                         # cosine similarity for every chunk
    top = np.argsort(scores)[::-1][:k]       # indices of the k best
    return [(chunks[i], sources[i], float(scores[i])) for i in top]

Then we assemble a grounded prompt and ask the local chat model. The instruction to answer only from the context — and to admit when the answer isn't there — is what keeps the model from hallucinating confidently over your data.

CHAT_MODEL = "qwen3:8b"

def answer(query: str, retrieved) -> str:
    context = "\n\n".join(f"[Source: {src}]\n{text}" for text, src, _ in retrieved)
    prompt = (
        "You are a precise assistant. Answer the question using ONLY the context below. "
        "If the answer is not in the context, say you don't know.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\nAnswer:"
    )
    resp = ollama.chat(
        model=CHAT_MODEL,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp["message"]["content"]

Finally, a tiny REPL to tie it together:

if __name__ == "__main__":
    print("Indexing local documents...")
    chunks, sources, embeddings = build_index(DOCS_DIR)
    print(f"Indexed {len(chunks)} chunks. Ask a question (Ctrl+C to exit).")
    while True:
        try:
            question = input("\n> ").strip()
        except (EOFError, KeyboardInterrupt):
            print()
            break
        if not question:
            continue
        hits = cosine_search(question, chunks, sources, embeddings)
        print("\n" + answer(question, hits))
        print("\nSources: " + ", ".join(sorted({src for _, src, _ in hits})))

Drop a few .md or .txt files into ./docs, run python rag.py, and you have a working private assistant over your own documents. The full script is just the blocks above, in order, in one file.

Note on "thinking" models. Some reasoning models (Qwen3 among them) can emit internal <think> traces. If you see them in the output, either pick a non-reasoning model or pass the library's think=False option to ollama.chat. For a clean Q&A bot, plain instruction-following models are simplest.

Step 5 — Self-hosting on an EU VPS: residency and a DPIA mindset

Running on your laptop proves the concept. For a team, you'll want it on a server you control — ideally one in your own jurisdiction. A GPU VPS hosted in the EU gives you the same offline pipeline with data residency baked in: the documents, the embeddings, and the model all stay on infrastructure inside the EEA.

A few practical, privacy-by-design habits when you productionize this:

Data minimisation. Only index what the assistant actually needs. Don't bulk-load an entire shared drive because you can.
Access control. Put the service behind authentication and network restrictions; "internal only" should mean it.
Retention and logs. Decide whether you log prompts at all. If you do, treat those logs as the sensitive data they are.
DPIA mindset. For higher-risk processing, a Data Protection Impact Assessment is how you reason through this before launch, not after an incident.

Not legal advice. This section is engineering guidance, not a legal opinion. For the actual obligations that apply to your processing, consult your DPO and the primary sources — the GDPR text on EUR-Lex and your national supervisory authority. Architecture can support compliance, but it doesn't replace a proper assessment.

Choosing your model: Qwen3 vs Gemma 3 vs Llama 4

Your two levers are capability and license. A rough 2026 map for the chat model:

Model	License (verify the card)	Notes
Qwen3	Apache 2.0 (open variants)	Strong all-rounder; the safe default for commercial use.
Gemma 3	Custom Google license (not Apache)	Capable and efficient, but read the Gemma terms before shipping.
Llama 4	Meta community license (custom)	Solid ecosystem; conditions apply at large scale.

For embeddings, nomic-embed-text is a great default (768-dim, fast, good quality). If you need stronger multilingual retrieval, bge-m3 and mxbai-embed-large (both 1024-dim) are worth a look — just keep your embedding model consistent between indexing and querying, because vectors from different models aren't comparable.

The honest rule on size: bigger isn't automatically better for RAG. Because the answer is grounded in retrieved context, a well-tuned 7–8B model often matches a much larger one on factual Q&A, while running comfortably on modest hardware.

Hardware sizing and GGUF quantization

Ollama serves models in GGUF format, quantized to shrink memory use with minimal quality loss. The quant level you pick is the main RAM/VRAM lever:

q4_K_M — the balanced default. Roughly 4-bit; a 7–8B model lands in the low-single-digit gigabytes and runs on most modern machines. Start here.
q8_0 — near-lossless, about twice the footprint of q4. Use it when you have the memory and want maximum fidelity.
fp16 — full half-precision; largest, rarely necessary for local RAG.

Rules of thumb, not guarantees: a 7–8B model at q4_K_M is comfortable on a machine with ~8 GB of free RAM/VRAM; step up to a 13–14B model and you'll want noticeably more. Treat these as starting points and watch actual usage — your context length and concurrency matter too.

Limitations, and when local is the wrong call

Self-hosting isn't a free lunch. Be honest about the trade-offs:

Top-end quality. The very largest frontier models still lead on the hardest reasoning. For grounded document Q&A the gap is small; for open-ended reasoning it's real.
Ops burden. You now own uptime, updates, GPU drivers, and scaling. A managed API hands that to someone else.
Throughput. One GPU serves limited concurrency. High-traffic, bursty workloads can be cheaper and simpler on a metered API.

A reasonable heuristic: local for sensitive, steady, or offline workloads; hosted for spiky, public-data, or cutting-edge-reasoning needs. Many teams run both and route by data sensitivity. Cost-optimising LLM usage in production — caching, routing, batching across local and hosted — is a topic of its own, covered in the advanced LLM integration course.

One more thing before you trust a local model in production: measure it. Build a small evaluation set of question/expected-answer pairs and score retrieval and answer quality before and after any change (different model, chunk size, or quant). "It felt fine in the demo" is how regressions ship. If you want a rigorous approach to that, see the course on LLM evals for production.

Conclusion

You now have a complete, private RAG pipeline that runs entirely on your own hardware: local embeddings, an in-memory vector store, local generation, and a one-line path to migrate an existing OpenAI app to it. No documents leave the machine, no per-token bill, no third-party processor in the data path — which is exactly what makes it a defensible choice under GDPR.

The pattern scales further than this script: swap the in-memory store for a persistent local vector DB when you outgrow NumPy, put it behind auth on an EU VPS, and add an eval harness so you can change models with confidence. But the core stays the same — retrieve, augment, generate, all on infrastructure you control.

Start small: drop ten documents into ./docs, run it, and ask it something only your files know. The first correct, fully-offline answer is a good feeling. From there, the only question is how much of your private knowledge you want to make queryable — privately.

5 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Gunjan Tailor · Answer 1 · 2026-06-08T10:50:53+0000

This is one of the cleanest local-RAG writeups I've seen — the OpenAI-compatible one-line repoint and the NumPy-in-memory store are exactly right for tens of thousands of chunks. One thing I'd push on: the 800-char overlapping chunker is still the weak link, even fully offline. Overlap saves a sentence that straddles a boundary, but a table row like "45.2% | Q3 | Europe" still gets flattened away from its headers, and nomic-embed will happily embed that noise. For text-heavy docs it's fine; for contracts and financials the failure mode is silent and confident. Your "answer only from context, else say you don't know" instruction is the right backstop though — that refusal habit prevents most confident-wrong answers. I've been building an ingestion engine (docnest) around preserving structure before chunking for exactly this reason. Bookmarking — the GDPR/DPIA framing is genuinely useful.

Ken W. Algerverified · Answer 2 · 2026-06-08T14:04:11+0000

Ken W. Algerverified • Jun 8

This is an exceptional, pragmatic guide to building a truly private, localized knowledge base. You’ve cleanly dismantled the assumption that robust RAG requires sacrificing data custody to external cloud APIs.

What's particularly compelling about your architecture is how closely it mirrors the core design patterns outlined in the open-source Sovereign Systems Specification. Your build is a textbook implementation of a couple of critical patterns defined in the spec:

The Ingestion Boundary & Data Custody: By handling file extraction, text chunking, and embedding entirely within a local Python runtime before persisting to an isolated instance of ChromaDB, your pipeline aligns perfectly with the spec's requirements for secure data custody. You’re ensuring that the "authority-bearing layer" of your data remains completely immune to external telemetry or third-party training cycles.
Deterministic Context Isolation: Passing the retrieved context chunks to a locally running Ollama instance satisfies the spec's pattern for isolated execution contexts. Many teams build "private" RAG but still pipe the final augmented prompt to a cloud LLM, which introduces a quiet data-lineage leak. Your architecture maintains strict physical and semantic boundaries from file upload to token generation.

For anyone trying to navigate GDPR, HIPAA, or strict IP protections, this self-hosted layout is the foundational floor. Thanks for putting together such a clear blueprint.

galian • Jun 8

@[Ken W. Alger] Thank you for such a detailed and thoughtful analysis. I genuinely appreciate the connection you've made to the Sovereign Systems Specification.

What initially motivated this architecture was not performance or cost optimization, but a simple question: "How can we give organizations the benefits of AI without forcing them to surrender control of their data?" The deeper I went into the problem, the more it became clear that many so-called "private AI" solutions still contain hidden trust assumptions that break true data sovereignty.

Your observations around the Ingestion Boundary and Data Custody are especially important. In my experience, many teams focus heavily on the model while overlooking the ingestion pipeline, even though that's where some of the most critical privacy decisions are made. Once documents leave a controlled environment during extraction, embedding, or indexing, it becomes difficult to make strong guarantees about governance and compliance.

I also completely agree regarding Deterministic Context Isolation. One of the reasons I chose the Ollama-compatible architecture was precisely to avoid the subtle data-lineage issues that appear when retrieval happens locally but generation is delegated to an external provider. For many use cases, that final step is where the privacy promise quietly falls apart.

What I find particularly exciting is that these architectures are no longer reserved for large enterprises. With modern local models, efficient embedding systems, and lightweight vector databases, it's now entirely feasible for SMEs, law firms, healthcare providers, accounting practices, and public-sector organizations to deploy sovereign AI capabilities on their own infrastructure.

Ultimately, I believe the next phase of AI adoption in Europe will be driven not only by model quality, but by trust, governance, and data ownership. Building systems where organizations retain full control over their knowledge assets is becoming a strategic requirement rather than a technical preference.

Thank you again for the insightful feedback and for highlighting the alignment with the Sovereign Systems Specification. It's encouraging to see these architectural principles gaining traction across the community. 🚀

Ken W. Algerverified • Jun 8

@[galian] You’ve articulated the exact strategic shift that inspired the specification. For too long, the industry treated data privacy as a prompt-engineering problem rather than an infrastructure custody problem.

Your point about SMEs, law firms, and healthcare providers is where the rubber meets the road. Historically, only massive enterprises could afford the infrastructure overhead of fully isolated on-prem systems. The fact that an SME or a local accounting firm can now deploy an Ollama-driven, GDPR-compliant RAG architecture on a single piece of hardware completely changes the game. It democratizes true data custody.

You are spot on regarding the European regulatory landscape. As the compliance burden shifts from theoretical risk to hard legal liability, architectures like yours—where data lineage never crosses a network boundary—will become the default starting point for any serious implementation.

Fantastic work building a blueprint that proves data sovereignty is an achievable engineering reality, not just a theoretical ideal. Looking forward to seeing how your pipeline evolves.

	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	Architecting a Local-First Hybrid RAG for Finance Pocket Portfolio - Feb 25
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	The Privacy Gap: Why sending financial ledgers to OpenAI is broken Pocket Portfolio - Feb 23
	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9

Private, Offline RAG in Python with Ollama: A Self-Hosted, GDPR-Friendly Build

Why run RAG locally? Privacy, cost, and data sovereignty

What we're building

Step 1 — Install Ollama and pull your models (and check the license)

Step 2 — The one-line migration: Ollama's OpenAI-compatible API

Step 3 — Local embeddings and an in-memory vector store

Step 4 — Retrieval and the RAG loop

Step 5 — Self-hosting on an EU VPS: residency and a DPIA mindset

Choosing your model: Qwen3 vs Gemma 3 vs Llama 4

Hardware sizing and GGUF quantization

Limitations, and when local is the wrong call

Conclusion

5 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Architecting a Local-First Hybrid RAG for Finance

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

More From galian

The AI Engineering Skill Map for 2026: What to Learn, in What Order (and Why)

Building a Reliable Agentic Loop: Retries, Tool Errors, and Knowing When to Stop

Why LLMs Return Broken JSON — and How Structured Outputs Actually Fix It

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,738 amazing developers

Don't have an account? Sign up

OR

Private, Offline RAG in Python with Ollama: A Self-Hosted, GDPR-Friendly Build

Why run RAG locally? Privacy, cost, and data sovereignty

What we're building

Step 1 — Install Ollama and pull your models (and check the license)

Step 2 — The one-line migration: Ollama's OpenAI-compatible API

Step 3 — Local embeddings and an in-memory vector store

Step 4 — Retrieval and the RAG loop

Step 5 — Self-hosting on an EU VPS: residency and a DPIA mindset

Choosing your model: Qwen3 vs Gemma 3 vs Llama 4

Hardware sizing and GGUF quantization

Limitations, and when local is the wrong call

Conclusion

5 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From galian

Related Jobs

Commenters (This Week)