How I Built a PII Tokenization Middleware to Keep Sensitive Data Out of LLM APIs

posted Originally published at dev.to 6 min read

The Problem I Kept Ignoring

Every time we sent a customer transcript to an LLM API, we were sending real data — credit card numbers, home addresses, full names, national IDs — in plaintext to a third-party server.

Most teams I've talked to handle this in one of two ways:

  1. Ignore it and hope the provider's data processing agreement covers them
  2. Prompt engineer around it — "don't repeat personal information in your response" — which does nothing about what's already been transmitted

Neither is acceptable in a production system handling real user data. So I built llm-hasher — a PII tokenization middleware that sits between your application and any LLM API.


The Core Idea

The LLM doesn't need to see the actual credit card number to summarize a support transcript. It just needs to know a credit card number was mentioned. So instead of:

"Hi, my card is 4111-1111-1111-1111 and email is john@example.com"

The LLM receives:

"Hi, my card is CREDIT_CARD_john12_4f8a2b and email is EMAIL_john12_9c3d1a"

It can still reason about the context. It just never touches the real values. When the response comes back, you detokenize it and restore the originals.


Architecture

Your App ──► POST /v1/tokenize ──► llm-hasher ──► tokenized text
                                       │
                              detects PII locally
                              (Ollama, no cloud)
                              stores in encrypted vault
 
Your App ──► [your LLM call with tokenized text]
 
Your App ──► POST /v1/detokenize ──► llm-hasher ──► original text restored

Three moving parts: a detector, a vault, and an HTTP service wrapping both.


Detection: Hybrid Regex + LLM

PII falls into two categories that require different detection strategies.

Structured PII — credit cards, emails, IBANs, IPv4 addresses — has well-defined patterns. Regex handles these with sub-millisecond latency and 100% recall on valid formats. No need to involve a language model.

Contextual PII — names, addresses, national IDs, passports — is where regex breaks down completely. "John Smith" looks identical to "Smith & Wesson" to a pattern matcher. You need semantic understanding.

For contextual PII, llm-hasher sends the text to a locally running Ollama instance. The model (default: llama3.2:3b) extracts entities and returns structured JSON. Because Ollama runs on your own server, this detection step never touches an external API — your raw data stays on your infra.

The hybrid approach gives you the best of both: speed and precision for structured types, semantic understanding for contextual ones.

Chunking for Long Texts

Sending a 5,000-word transcript to Ollama in one shot causes problems — context window limits, degraded accuracy on long inputs, serial latency.

llm-hasher chunks large texts (configurable, default 800 words) and processes chunks in parallel goroutines:

// Simplified — actual implementation handles overlap and deduplication
func (d *Detector) detectParallel(ctx context.Context, text string) ([]Entity, error) {
    chunks := chunk(text, d.cfg.ChunkSize)
    results := make(chan []Entity, len(chunks))
 
    var wg sync.WaitGroup
    for _, chunk := range chunks {
        wg.Add(1)
        go func(c string) {
            defer wg.Done()
            entities, _ := d.detectWithOllama(ctx, c)
            results <- entities
        }(chunk)
    }
 
    wg.Wait()
    close(results)
    return merge(results), nil
}

A 5,000-word document with 6 chunks processes in roughly the same time as a single chunk — latency scales with the slowest chunk, not the total count.


The Vault: AES-256-GCM Encrypted SQLite

Token-to-value mappings are stored in a local SQLite database. Each value is encrypted with AES-256-GCM before being written.

Key design decisions:

Context scoping with your own IDs. Instead of generating opaque foreign UUIDs that you'd need to track on your side, you pass a context_id from your domain:

{
  "text": "Hi, my card is 4111-1111-1111-1111",
  "context_id": "zoom_call_789"
}

This means your Zoom call processor can detokenize with zoom_call_789 without needing to store a mapping between your ID and a vault-generated UUID.

Deduplication within a context. The same PII value within a context always maps to the same token. If a name appears five times in a transcript, the LLM sees the same token each time — so it can reason about the entity consistently across the full text.

TTL support. Tokens can have an expiry:

{
  "text": "...",
  "context_id": "session_abc",
  "ttl": "24h"
}

For compliance scenarios (GDPR right to erasure), there's a hard-delete endpoint:

DELETE /v1/contexts/{context_id}

This removes all mappings for that context from the vault. Once deleted, detokenization is impossible — by design.


Detokenization: Single-Pass Multi-String Replace

Naive detokenization would loop through each token and do a string replace — O(n×m) where n is text length and m is token count. For a transcript with 40 entities, that's 40 passes over the text.

llm-hasher builds an Aho-Corasick automaton from the token set and does a single linear pass:

func (v *Vault) Detokenize(text string, mappings map[string]string) string {
    replacer := strings.NewReplacer(flatten(mappings)...)
    return replacer.Replace(text)
}

Detokenization latency is effectively constant regardless of token count — typically under 5ms even for large documents.


Real-World Integration

Python — LLM Proxy Pattern

import requests
import openai
 
# 1. Tokenize before sending to LLM
resp = requests.post("http://localhost:8080/v1/tokenize", json={
    "text": transcript,
    "context_id": f"zoom_{call_id}"
})
tokenized = resp.json()
 
# 2. Send tokenized text to your LLM
llm_response = openai.chat.completions.create(
    messages=[
        {"role": "system", "content": "Summarize this call transcript."},
        {"role": "user",   "content": tokenized["tokenized_text"]}
    ]
)
 
# 3. Detokenize the LLM response
final = requests.post("http://localhost:8080/v1/detokenize", json={
    "text": llm_response.choices[0].message.content,
    "context_id": f"zoom_{call_id}"
})
print(final.json()["original_text"])

Go — Library Mode

If you don't want to run a separate HTTP service, import the hasher package directly:

import "github.com/yemrealtanay/llm-hasher/pkg/hasher"
 
h, err := hasher.New(
    hasher.WithOllama("http://localhost:11434", "llama3.2:3b"),
    hasher.WithVault("data/vault.db", ""),
)
defer h.Close()
 
result, err := h.Tokenize(ctx, transcript, "zoom_call_789", nil)
// result.Text contains tokenized transcript
 
original, err := h.Detokenize(ctx, llmResponse, "zoom_call_789")

Performance Characteristics

Scenario Typical Latency
Short text, regex PII only < 5ms
Short text with LLM detection 2–8s (model dependent)
Long text (5,000 words), 6 parallel chunks 3–10s
Detokenize (any size) < 5ms

The dominant cost is Ollama inference. On a modern laptop with llama3.2:3b, expect 2–4 seconds per chunk. A GPU or a larger/faster model changes this significantly. If your use case is async (batch processing, background jobs), the latency is generally acceptable without hardware changes.

For latency-sensitive paths, run tokenization asynchronously before the user-facing LLM call — most pipelines have a natural point to do this.


What It Doesn't Do (Yet)

It's not a firewall. If someone deliberately encodes PII to evade detection (e.g., spelling out digits), llm-hasher won't catch it. It handles the common case, not adversarial inputs.

Ollama recall isn't 100%. The LLM detector misses things, especially in noisy or multilingual text. Tuning confidence_threshold and chunk size helps, but there's no guarantee of perfect recall without human review.

No streaming support yet. Tokenization requires the full text — SSE/streaming tokenization is on the v2 roadmap.


Running It

git clone https://github.com/yemrealtanay/llm-hasher
cd llm-hasher
make docker-up

Docker Compose starts Ollama, pulls llama3.2:3b (~2GB), and starts the service on port 8080. Check it's running:

curl http://localhost:8080/healthz
# {"status":"ok"}

For production, set an explicit vault encryption key:

# Generate
openssl rand -hex 32
 
# Set in .env
VAULT_KEY=<your_64_char_hex_key>

If VAULT_KEY is not set, a key is auto-generated and saved to data/vault.key. Fine for development, not for production — you need the key to survive restarts.


What's Next

The v2 roadmap includes built-in LLM proxy endpoints (OpenAI-compatible and Anthropic), so instead of calling llm-hasher then your LLM separately, you point your existing OpenAI client at llm-hasher and it handles tokenization transparently in the middle. This would make adoption essentially zero-config for teams already using the OpenAI SDK.

Contributions are welcome, especially for v2 LLM provider adapters — each provider is a well-defined, self-contained implementation.


1 Comment

0 votes

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolioverified - Apr 1

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

snapsynapseverified - Apr 20
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

13 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!