I built a PDF parser that actually preserves table structure for RAG: here’s why it matters

I built a PDF parser that actually preserves table structure for RAG: here’s why it matters

posted 4 min read

It's a client demo. They're watching. I type:

"Which region had the highest revenue growth last quarter?"

My RAG app — three weeks of work, carefully tuned embeddings, clever prompts — responds instantly.
The client nods. Writes it down.
The answer was wrong. By almost double.

I spent three days debugging the wrong things.
Chunk size? Tried 256, 512, 1024. Nothing.
Temperature? 0.0, 0.3, 0.7. Still wrong.
Embeddings model? Swapped three of them. Nope.
Prompt engineering? Added "think step by step", "be precise", "do not hallucinate".
The LLM wasn't hallucinating. It was doing its best with this:
"45.2% Q3 Europe 38.1% Q2 Europe 41.7% Q3 Asia 29.3%"
Orphaned numbers. No column headers. No caption. No context.
The original table had all of that. My chunker ate it silently and handed the LLM garbage.

⚠️ The bug was never in retrieval. It was in ingestion. And I never thought to look there.

The dirty secret of RAG tutorials
Every tutorial shows you this pipeline:
PDF → extract text → chunk at 512 tokens → embed → store → retrieve → answer
Clean. Simple. Completely wrong for structured documents.
Here's what blind chunking does to real content:
DocumentWhat you hadWhat the LLM getsFinancial reportRevenue table with headersOrphaned numbers, zero contextLegal contract3-page clauseSplit mid-sentence, both halves uselessAPI docsFunction + code exampleCode separated from its descriptionResearch paperFigure with captionCaption on chunk 7, analysis on chunk 12
The LLM is not the problem.

You're feeding it garbage and expecting gold.

️ So I built the thing I wished existed
Meet DocNest — not another chunker.
A document normalization engine that reads structure before touching content.

Every heading → a navigable §section with its own ID
Every table → preserved as { caption, headers, rows[] } JSON
Every section → one-sentence LLM summary + BM25 keyword index
All of it → packed into a portable .udf file

python

from docnest.pipeline import DocNestPipeline
from docnest.reader import UDFIndex

# Convert — runs once, costs a few LLM calls
pipeline = DocNestPipeline(
    llm_provider="groq",           # free tier works perfectly
    llm_api_key="gsk_...",
    emb_provider="huggingface",    # local, no API key needed
)
pipeline.convert("report.pdf")    # → report.udf ✓

# Query
idx = UDFIndex.load("report.udf")
result = idx.query("Which region had the highest Q3 growth?")

print(result.answer)       # "Asia grew the most, up +12.4pp"
print(result.layer_used)   # 1
print(result.tokens_used)  # 0  ← yes, really. zero.

✅ Zero tokens. Correct answer. 18ms.
That's not a cherry-picked example. Here's why it's possible.

⚡ The 5-layer query engine
Instead of dumping the full document into an LLM every time, queries escalate through layers — stopping the moment one can answer confidently.
LayerWhat it doesTokensSpeed0Pre-computed summary + key numbers0< 1ms1BM25 + cosine → lands on exact §section0< 20ms2Section-scoped LLM call~3001–3s3Multi-section synthesis~9002–5s4Full document fallback~4000+5–15s
I expected layers 2–4 to do most of the work.

Layers 0 and 1 handle roughly 70% of real-world questions.

Think about that. Seven out of ten questions your users ask — the factual ones, the number lookups, the "what does section 3 say about X" queries — answered from a structured index at zero token cost.
You pay for LLM compute only when the question genuinely requires reasoning.

Real numbers. Not vibes.
25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.
Question typeScoreBasic facts (calories, macros)✅ 5/5Detailed nutrition (fiber, glycemic index)✅ 5/5Micronutrients (vitamins, minerals)✅ 4/5Hard synthesis (BMR, omega-3, antioxidants)✅ 5/5Edge cases + hallucination traps✅ 5/5Total24/25 — 96%
The one failure: a table-only page where the text parser extracted nothing.
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.

Handles 600-page PDFs without exploding your RAM
Standard Docling loads the full document into memory.
600 pages on a normal laptop = out of memory.
DocNest chunks it automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.
pythonfrom docnest.parsers.pdf import DoclingPDFParser

# Just works — auto-detects large PDFs
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")

# Or tune explicitly
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf")  #  low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf")  #  speed mode

Try it

bash  pip install docnest-ai

Formats: PDF (ML + fast) · DOCX · XLSX · HTML · Markdown
LLM providers: Groq (free) · OpenAI · Ollama (local) · Anthropic · Mistral · Google · Cohere
Vector backends: numpy (zero deps) · FAISS · ChromaDB
bash# CLI — because boilerplate is boring
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key financial risks?"
docnest view report.udf # structured HTML viewer in your browser
GitHub · PyPI · Format spec

What's still rough — honesty tax
This is 0.4.0a2 — alpha. It works on real documents, but:

PPTX parser isn't built yet
Qdrant and Weaviate backends are on the roadmap
SharePoint / Confluence connectors are planned

If any of those sound like something you want to build, the good first issues are labeled. The contributing guide is ready.

The question I keep thinking about
Most RAG infrastructure is built on the assumption that text extraction is a solved problem.
It isn't. Not for tables. Not for anything where position and relationship carry meaning.

What document type has caused you the most RAG pain?

For me it was financial tables. I've heard legal contracts and API docs are just as brutal.
Drop it in the comments. If it's a format DocNest doesn't handle yet — that's probably the next parser I build.

Building this in the open at github.com/tailorgunjan93/docnest. Stars, issues, and brutal feedback all welcome.

More Posts

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolioverified - Apr 1

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolioverified - Feb 25

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapse - Apr 20

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Pocket Portfolioverified - Feb 23

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9
chevron_left

Related Jobs

Commenters (This Week)

4 comments
3 comments
2 comments

Contribute meaningful comments to climb the leaderboard and earn badges!