Building a RAG Pipeline That Answers 100-Page PDFs in Under 5 Seconds — Architecture Decisions

Question

Building a RAG Pipeline That Answers 100-Page PDFs in Under 5 Seconds — Architecture Decisions

Kavyanjali posted May 7 4 min read

Introduction
Most RAG tutorials show you how to build a chatbot that answers questions from a PDF.
Very few talk about why certain architectural decisions matter — and what happens to performance, accuracy, and user trust when you get them wrong.
I built a RAG pipeline that answers questions across 100-page documents in under 5 seconds with source-page citations. Here's every decision I made, why I made it, and what I'd do differently.
Full code: github.com/karankavyanjali77-sys/ai-pdf-rag-chatbot

The Problem With Most RAG Implementations
The default approach most tutorials show:
Load PDF
Split into chunks
Embed everything
Retrieve top-k chunks
Send to LLM

This works. But it has three problems that matter in production:
Problem 1 — Chunk size is arbitrary.
Most tutorials use 1000 characters per chunk with 200 character overlap. Nobody explains why. The chunk size directly determines whether your retrieved context is useful or noisy.
Problem 2 — Top-k retrieval optimises for recall, not precision.
Retrieving the top 5 chunks sounds safe. But if 3 of those chunks are tangentially related, you're sending noise to the LLM. Noise produces hallucinations.
Problem 3 — No source attribution.
Users don't trust answers they can't verify. Without source-page citations, a RAG system is just a black box.
My architecture addresses all three.
The Stack
LangChain — orchestration layer
FAISS — vector store (local, fast, no API cost)
Sentence — Transformers embedding model
Transformers — (all-MiniLM-L6-v2)
Groq LLM — inference (llama3-8b-8192)
PyPDF — PDF parsing
Streamlit — UI layer

Why Groq instead of OpenAI? Speed. Groq's inference on llama3 is significantly faster than GPT-3.5 for this use case and costs less at scale. For a system where sub-5-second response time is the goal — inference speed matters.
Why FAISS instead of Pinecone or ChromaDB? For a single-user document Q&A system, a local vector store has zero latency overhead. No API calls. No network round trips. Pure in-memory search.

The Architecture Decision That Matters Most — Chunking Strategy
This is where most RAG implementations go wrong.
I tested three chunking strategies:
Strategy 1 — Fixed size (1000 chars, 200 overlap)
Simple. Fast. But splits mid-sentence frequently. Retrieval quality suffers because semantic meaning gets cut.
Strategy 2 — Recursive character splitting
LangChain's RecursiveCharacterTextSplitter tries to split on paragraphs first, then sentences, then words. Better semantic coherence than fixed size.
Strategy 3 — Smaller chunks with higher overlap (500 chars, 100 overlap)
Counter-intuitive but this is what I landed on. Here's why:

Smaller chunks = more precise retrieval. When a user asks a specific question, a 500-character chunk containing exactly the relevant sentence retrieves better than a 1000-character chunk that contains the relevant sentence plus 500 characters of surrounding noise.

Higher overlap = context preservation. The 100-character overlap ensures that sentences split across chunk boundaries still appear in at least one complete chunk.

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=500,
chunk_overlap=100,
length_function=len,
separators=["\n\n", "\n", ".", "!", "?", ",", " "]

)

Result: Retrieval precision improved measurably. Answers became more specific and less likely to include irrelevant surrounding context.

Precision Over Recall — The Retrieval Decision
Most RAG implementations retrieve top-5 or top-10 chunks. More context = safer answer.
I chose top-3.

Here's why: LLMs hallucinate more when given noisy context than when given limited but precise context. Sending 5 chunks where 2 are irrelevant is worse than sending 3 chunks where all 3 are relevant.

The tradeoff: occasionally the answer requires information from a 4th chunk that gets missed. I mitigated this by using a similarity threshold — chunks below 0.7 cosine similarity are excluded regardless of k.

retriever = vectorstore.as_retriever(

search_type="similarity_score_threshold",
search_kwargs={
    "k": 3,
    "score_threshold": 0.7
}

)

Source-Page Citations — Building User Trust
Every answer surfaces the source page number from the original PDF.
This required storing page metadata during the chunking phase:

pages = loader.load_and_split()
for i, page in enumerate(pages):

page.metadata["page_number"] = i + 1

Then surfacing it in the response:

source_pages = list(set([

doc.metadata.get("page_number", "Unknown")
for doc in retrieved_docs

]))

response = f"{answer}\n\nSources: Pages {source_pages}"

Why this matters: A user asking a question about a legal document or technical specification needs to verify the answer. "The answer is X (Source: Page y)" is trusted. "The answer is X" is not.

The Two-Module Architecture
I separated the system into two independent modules:
app.py — Streamlit UI layer
rag_engine.py — Retrieval and generation logic
Why does this matter? The RAG engine can be called from any interface — a REST API, a Slack bot, a CLI tool — without touching the UI code. This is production-style architecture. Not a notebook demo.

rag_engine.py

class RAGEngine:

def __init__(self, pdf_path):
    self.vectorstore = self._build_vectorstore(pdf_path)
    self.chain = self._build_chain()

def answer(self, question):
    result = self.chain({"query": question})
    return result["result"], result["source_documents"]

Performance — How We Got Under 5 Seconds
End-to-end response time breakdown:
Step Time
Query embedding ~50ms
FAISS similarity search ~10ms
Groq LLM inference ~3–4 seconds
Total ~4–4.5 seconds

The bottleneck is LLM inference — not retrieval. FAISS is extremely fast for in-memory search. Groq's inference speed on llama3 is what makes sub-5-second response time achievable.

For comparison: the same pipeline with GPT-3.5-turbo averaged 8–12 seconds end-to-end. Groq is the reason we hit the performance target.

What I'd Do Differently
1. Add a reranker.
After initial retrieval, a cross-encoder reranker re-scores chunks by relevance to the specific question. This would further improve precision without reducing recall. I'd use a lightweight cross-encoder from sentence-transformers for this.
2. Implement conversation memory properly.
Currently session-level chat history is stored in Streamlit session state. For multi-session persistence — a proper vector-based memory system would be better.
3. Add an evaluation framework.
I have no automated way to measure retrieval quality across questions. A proper RAG evaluation framework using RAGAS would let me measure faithfulness, answer relevance, and context precision systematically.

Result:-

Answers 100-page documents in under 5 seconds
Source-page citations on every answer
Multi-PDF upload supported
Session-level chat history maintained
Retrieval precision prioritised over recall

Full Code
github.com/karankavyanjali77-sys/ai-pdf-rag-chatbot

About the Author
Kavyanjali Karan , a pre-final year CS undergrad at SOA University, Bhubaneswar (2027). Has built 8 deployed ML and data engineering systems. Selected for McKinsey Forward Program and Google Gen AI Academy APAC 2026. Currently seeking Data Science and Business Analytics intern roles.
GitHub: github.com/karankavyanjali77-sys
Email: Emails are not allowed

2 Comments

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Hetlink · Answer 1 · 2026-05-09T06:04:35+0000

Hetlink • May 9

Actually one of the better RAG architecture breakdowns I’ve read lately . How much testing did you do on chunk size before settling on this setup?

Kavyanjali • May 9

@[Hetlink] Thanks Hetlink — really glad it was useful.
Honest answer: I tested three configurations before settling.
Round 1 — 1000 chars, 200 overlap (default)
This is what most tutorials use. The problem I ran into was that retrieved chunks frequently contained the relevant sentence plus a lot of surrounding noise. The LLM would answer correctly but pad the response with irrelevant context from the same chunk. Accuracy was okay. Precision wasn't.
Round 2 — 800 chars, 150 overlap
Marginal improvement. Still felt too large for specific factual questions where the answer lived in 2–3 sentences.
Round 3 — 500 chars, 100 overlap (what I landed on)
This is where retrieval started feeling precise. Smaller chunks meant the similarity search returned segments that were almost entirely relevant to the query — less noise, more signal. The 100-char overlap preserved sentence continuity across chunk boundaries.
The tradeoff I accepted: occasionally a question requires context that spans a chunk boundary and the answer comes back slightly incomplete. I partially mitigated this with the similarity threshold (0.7 minimum cosine similarity) — chunks below that threshold get excluded even if they're in the top-k.
If I were building this again I'd add a reranker on top of the retrieval step — a cross-encoder that rescores the top-k chunks by relevance to the specific query. That would probably let me use slightly larger chunks without the noise problem.
What chunk size are you working with on your end?

	Architecting a Local-First Hybrid RAG for Finance Pocket Portfolio - Feb 25
	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	The Privacy Gap: Why sending financial ledgers to OpenAI is broken Pocket Portfolio - Feb 23

Building a RAG Pipeline That Answers 100-Page PDFs in Under 5 Seconds — Architecture Decisions

rag_engine.py

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Architecting a Local-First Hybrid RAG for Finance

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,340 amazing developers

Don't have an account? Sign up

OR

Building a RAG Pipeline That Answers 100-Page PDFs in Under 5 Seconds — Architecture Decisions

rag_engine.py

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Architecting a Local-First Hybrid RAG for Finance

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Related Jobs

Commenters (This Week)