Private, Offline RAG in Python with Ollama: A Self-Hosted, GDPR-Friendly Build

Private, Offline RAG in Python with Ollama: A Self-Hosted, GDPR-Friendly Build

3 17
calendar_today agoschedule11 min read

Most RAG tutorials quietly ship your documents to someone else's cloud. You paste a contract, a customer list, or an internal wiki into an embedding API, and that data now lives on a server you don't control, in a jurisdiction you didn't choose. For a side project, fine. For a company handling personal or confidential data under GDPR, that's a problem you have to answer for.

This guide builds the opposite: a private, offline RAG pipeline in Python with Ollama, where the language model, the embeddings, and the vector store all run on your own machine. Nothing leaves the box. By the end you'll have one small, framework-light script that answers questions over your local documents — fully self-hosted, with a clear path to running it on an EU VPS for data residency.

No LangChain, no managed vector database, no API keys. Just ollama, numpy, and about 70 lines of Python you can actually read.

Why run RAG locally? Privacy, cost, and data sovereignty

There are three honest reasons to keep RAG on-premises, and "it's cool" isn't one of them.

Privacy and data residency. The moment you send text to a hosted embedding or chat API, you've performed a data transfer to a third-party processor — and if their servers sit outside the EEA, you've also triggered the international-transfer rules of GDPR. Keeping everything local sidesteps that entire category of risk: there is no transfer because there is no third party. For regulated data (health, finance, HR, legal), "the data physically never left our infrastructure" is the cleanest compliance story you can tell.

Cost at scale. Per-token pricing is great until you're embedding a few million chunks and re-embedding them every time you tweak your chunking. Local inference turns a variable, usage-coupled bill into a fixed hardware cost. If you run a steady workload, the math flips in favor of self-hosting fast.

Control and longevity. No surprise deprecations, no rate limits mid-refactor, no model swapped out from under you. The weights you downloaded today will still run next year, offline, on a plane, in an air-gapped network.

Local LLMs stopped being a hobbyist curiosity somewhere in 2025. Ollama made pulling and running open-weight models a one-liner, and privacy is now consistently cited as a leading barrier to enterprise cloud-LLM adoption — which is exactly the gap local inference fills. If you want the structured, end-to-end version of this topic, there's a full course on local LLMs, privacy, and self-hosting (Romanian-language) that goes deeper than one tutorial can; I'll link a few relevant ones as we hit each topic.

What we're building

One Python script, rag.py, that:

  1. Reads .txt and .md files from a local ./docs folder.
  2. Splits them into overlapping chunks.
  3. Embeds each chunk with a local embedding model via Ollama.
  4. Stores the vectors in memory with NumPy (no external DB).
  5. On each question, embeds the query, finds the most similar chunks by cosine similarity, and feeds them to a local chat model as grounded context.

That's the whole RAG loop: retrieve, augment, generate — with every step running on your hardware. RAG itself (chunking, embeddings, retrieval quality) is a deep subject; if you want the theory behind the choices here, there's a dedicated RAG course that covers it properly.

Step 1 — Install Ollama and pull your models (and check the license)

Install Ollama from ollama.com, then pull one chat model and one embedding model:

# A capable, permissively licensed chat model (Apache 2.0)
ollama pull qwen3:8b

# A fast, high-quality embedding model
ollama pull nomic-embed-text

Verify the daemon is up:

ollama list
ollama run qwen3:8b "Say hello in one short sentence."

A word on licenses, because this is where people get burned. "Open-weight" does not mean "do whatever you want commercially." As of 2026:

  • Qwen3 ships under Apache 2.0 for its open variants — genuinely permissive, commercial use included.
  • Gemma 3 uses a custom Google license (the Gemma Terms of Use), not Apache 2.0, with its own use restrictions.
  • Llama 4 uses Meta's community license, which is custom and has conditions (notably around very large deployments and naming).

The takeaway: before you ship anything commercial, open the model's card and read its actual license. Don't trust a blog post (including this one) as your legal source — model terms change. We default to Qwen3 here precisely because Apache 2.0 keeps the legal surface small.

Install the Python dependencies:

pip install ollama numpy

Step 2 — The one-line migration: Ollama's OpenAI-compatible API

Here's the detail that makes adoption painless. Ollama exposes an OpenAI-compatible endpoint, so if you already have code written against the OpenAI SDK, you don't rewrite it — you repoint it:

from openai import OpenAI

# The ONLY changes from a normal OpenAI setup:
client = OpenAI(
    base_url="http://localhost:11434/v1",  # your local Ollama
    api_key="ollama",                       # required by the SDK, but ignored locally
)

resp = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Explain RAG in one sentence."}],
)
print(resp.choices[0].message.content)

Same chat.completions.create, same response shape — but it runs on localhost and never makes an outbound request. That's your migration path for an existing app: swap base_url and api_key, point model at a local model, done. Building production apps against these SDKs (and abstracting over multiple providers cleanly) is its own discipline — covered in the course on building AI apps with Python and the OpenAI/Anthropic SDKs.

For the rest of this tutorial I'll use the native ollama Python library instead, because it's the most direct way to call a local model — but everything below works through the OpenAI-compatible client too.

Step 3 — Local embeddings and an in-memory vector store

Embeddings turn text into vectors so we can measure semantic similarity. With Ollama, that's one call to a local model — nomic-embed-text returns 768-dimensional vectors and never phones home.

import ollama
import numpy as np

EMBED_MODEL = "nomic-embed-text"

def embed_texts(texts: list[str]) -> np.ndarray:
    """Embed a batch of strings with a local Ollama model."""
    resp = ollama.embed(model=EMBED_MODEL, input=texts)
    return np.array(resp["embeddings"], dtype=np.float32)

Now the documents. We load every .txt and .md file under ./docs, then split each into overlapping character chunks. Overlap matters: it keeps a sentence that straddles a chunk boundary from being cut in half and losing its meaning.

import os
import glob

DOCS_DIR = "docs"
CHUNK_SIZE = 800       # characters per chunk
CHUNK_OVERLAP = 100    # characters shared between neighbours

def load_documents(folder: str) -> list[tuple[str, str]]:
    paths = (
        glob.glob(os.path.join(folder, "**", "*.txt"), recursive=True)
        + glob.glob(os.path.join(folder, "**", "*.md"), recursive=True)
    )
    docs = []
    for path in paths:
        with open(path, "r", encoding="utf-8") as f:
            docs.append((path, f.read()))
    return docs

def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    chunks, start = [], 0
    step = max(1, size - overlap)  # guard against overlap >= size
    while start < len(text):
        piece = text[start:start + size].strip()
        if piece:
            chunks.append(piece)
        start += step
    return chunks

We build the index once at startup: every chunk plus its source path and its vector, all held in memory.

def build_index(folder: str):
    docs = load_documents(folder)
    if not docs:
        raise SystemExit(
            f"No .txt or .md files found in ./{folder} — add some documents first."
        )
    chunks, sources = [], []
    for path, text in docs:
        for ch in chunk_text(text):
            chunks.append(ch)
            sources.append(path)
    embeddings = embed_texts(chunks)  # shape: (n_chunks, 768)
    return chunks, sources, embeddings

No Pinecone, no Chroma server, no network. For a knowledge base up to tens of thousands of chunks, NumPy in memory is genuinely enough — and it keeps the "nothing leaves the machine" guarantee absolute.

Step 4 — Retrieval and the RAG loop

Retrieval is cosine similarity between the query vector and every chunk vector. We normalize both sides, take the dot product, and grab the top matches. The small + 1e-10 avoids a division-by-zero if a vector is all zeros.

TOP_K = 4

def cosine_search(query: str, chunks, sources, embeddings, k: int = TOP_K):
    q = embed_texts([query])[0]
    q = q / (np.linalg.norm(q) + 1e-10)
    mat = embeddings / (np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-10)
    scores = mat @ q                         # cosine similarity for every chunk
    top = np.argsort(scores)[::-1][:k]       # indices of the k best
    return [(chunks[i], sources[i], float(scores[i])) for i in top]

Then we assemble a grounded prompt and ask the local chat model. The instruction to answer only from the context — and to admit when the answer isn't there — is what keeps the model from hallucinating confidently over your data.

CHAT_MODEL = "qwen3:8b"

def answer(query: str, retrieved) -> str:
    context = "\n\n".join(f"[Source: {src}]\n{text}" for text, src, _ in retrieved)
    prompt = (
        "You are a precise assistant. Answer the question using ONLY the context below. "
        "If the answer is not in the context, say you don't know.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\nAnswer:"
    )
    resp = ollama.chat(
        model=CHAT_MODEL,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp["message"]["content"]

Finally, a tiny REPL to tie it together:

if __name__ == "__main__":
    print("Indexing local documents...")
    chunks, sources, embeddings = build_index(DOCS_DIR)
    print(f"Indexed {len(chunks)} chunks. Ask a question (Ctrl+C to exit).")
    while True:
        try:
            question = input("\n> ").strip()
        except (EOFError, KeyboardInterrupt):
            print()
            break
        if not question:
            continue
        hits = cosine_search(question, chunks, sources, embeddings)
        print("\n" + answer(question, hits))
        print("\nSources: " + ", ".join(sorted({src for _, src, _ in hits})))

Drop a few .md or .txt files into ./docs, run python rag.py, and you have a working private assistant over your own documents. The full script is just the blocks above, in order, in one file.

Note on "thinking" models. Some reasoning models (Qwen3 among them) can emit internal <think> traces. If you see them in the output, either pick a non-reasoning model or pass the library's think=False option to ollama.chat. For a clean Q&A bot, plain instruction-following models are simplest.

Step 5 — Self-hosting on an EU VPS: residency and a DPIA mindset

Running on your laptop proves the concept. For a team, you'll want it on a server you control — ideally one in your own jurisdiction. A GPU VPS hosted in the EU gives you the same offline pipeline with data residency baked in: the documents, the embeddings, and the model all stay on infrastructure inside the EEA.

A few practical, privacy-by-design habits when you productionize this:

  • Data minimisation. Only index what the assistant actually needs. Don't bulk-load an entire shared drive because you can.
  • Access control. Put the service behind authentication and network restrictions; "internal only" should mean it.
  • Retention and logs. Decide whether you log prompts at all. If you do, treat those logs as the sensitive data they are.
  • DPIA mindset. For higher-risk processing, a Data Protection Impact Assessment is how you reason through this before launch, not after an incident.

Not legal advice. This section is engineering guidance, not a legal opinion. For the actual obligations that apply to your processing, consult your DPO and the primary sources — the GDPR text on EUR-Lex and your national supervisory authority. Architecture can support compliance, but it doesn't replace a proper assessment.

Choosing your model: Qwen3 vs Gemma 3 vs Llama 4

Your two levers are capability and license. A rough 2026 map for the chat model:

Model License (verify the card) Notes
Qwen3 Apache 2.0 (open variants) Strong all-rounder; the safe default for commercial use.
Gemma 3 Custom Google license (not Apache) Capable and efficient, but read the Gemma terms before shipping.
Llama 4 Meta community license (custom) Solid ecosystem; conditions apply at large scale.

For embeddings, nomic-embed-text is a great default (768-dim, fast, good quality). If you need stronger multilingual retrieval, bge-m3 and mxbai-embed-large (both 1024-dim) are worth a look — just keep your embedding model consistent between indexing and querying, because vectors from different models aren't comparable.

The honest rule on size: bigger isn't automatically better for RAG. Because the answer is grounded in retrieved context, a well-tuned 7–8B model often matches a much larger one on factual Q&A, while running comfortably on modest hardware.

Hardware sizing and GGUF quantization

Ollama serves models in GGUF format, quantized to shrink memory use with minimal quality loss. The quant level you pick is the main RAM/VRAM lever:

  • q4_K_M — the balanced default. Roughly 4-bit; a 7–8B model lands in the low-single-digit gigabytes and runs on most modern machines. Start here.
  • q8_0 — near-lossless, about twice the footprint of q4. Use it when you have the memory and want maximum fidelity.
  • fp16 — full half-precision; largest, rarely necessary for local RAG.

Rules of thumb, not guarantees: a 7–8B model at q4_K_M is comfortable on a machine with ~8 GB of free RAM/VRAM; step up to a 13–14B model and you'll want noticeably more. Treat these as starting points and watch actual usage — your context length and concurrency matter too.

Limitations, and when local is the wrong call

Self-hosting isn't a free lunch. Be honest about the trade-offs:

  • Top-end quality. The very largest frontier models still lead on the hardest reasoning. For grounded document Q&A the gap is small; for open-ended reasoning it's real.
  • Ops burden. You now own uptime, updates, GPU drivers, and scaling. A managed API hands that to someone else.
  • Throughput. One GPU serves limited concurrency. High-traffic, bursty workloads can be cheaper and simpler on a metered API.

A reasonable heuristic: local for sensitive, steady, or offline workloads; hosted for spiky, public-data, or cutting-edge-reasoning needs. Many teams run both and route by data sensitivity. Cost-optimising LLM usage in production — caching, routing, batching across local and hosted — is a topic of its own, covered in the advanced LLM integration course.

One more thing before you trust a local model in production: measure it. Build a small evaluation set of question/expected-answer pairs and score retrieval and answer quality before and after any change (different model, chunk size, or quant). "It felt fine in the demo" is how regressions ship. If you want a rigorous approach to that, see the course on LLM evals for production.

Conclusion

You now have a complete, private RAG pipeline that runs entirely on your own hardware: local embeddings, an in-memory vector store, local generation, and a one-line path to migrate an existing OpenAI app to it. No documents leave the machine, no per-token bill, no third-party processor in the data path — which is exactly what makes it a defensible choice under GDPR.

The pattern scales further than this script: swap the in-memory store for a persistent local vector DB when you outgrow NumPy, put it behind auth on an EU VPS, and add an eval harness so you can change models with confidence. But the core stays the same — retrieve, augment, generate, all on infrastructure you control.

Start small: drop ten documents into ./docs, run it, and ask it something only your files know. The first correct, fully-offline answer is a good feeling. From there, the only question is how much of your private knowledge you want to make queryable — privately.

449 Points20 Badges3 17
11Posts
6Comments
9Followers
8Connections
Founder of Cursuri-AI.ro and Co-Founder of ProtectAds.com. Passionate about scalable architectures, AI integration, and building premium software solutions.
Build your own developer journey
Track progress. Share learning. Stay consistent.

5 Comments

2 votes
1
1 vote
0
1
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolio - Apr 1

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolio - Feb 25

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Pocket Portfolio - Feb 23

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9
chevron_left

Commenters (This Week)

2 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!