Introduction
Most RAG tutorials show you how to build a chatbot that answers questions from a PDF.
Very few talk about why certain architectural decisions matter — and what happens to performance, accuracy, and user trust when you get them wrong.
I built a RAG pipeline that answers questions across 100-page documents in under 5 seconds with source-page citations. Here's every decision I made, why I made it, and what I'd do differently.
Full code: github.com/karankavyanjali77-sys/ai-pdf-rag-chatbot
The Problem With Most RAG Implementations
The default approach most tutorials show:
Load PDF
Split into chunks
Embed everything
Retrieve top-k chunks
Send to LLM
This works. But it has three problems that matter in production:
Problem 1 — Chunk size is arbitrary.
Most tutorials use 1000 characters per chunk with 200 character overlap. Nobody explains why. The chunk size directly determines whether your retrieved context is useful or noisy.
Problem 2 — Top-k retrieval optimises for recall, not precision.
Retrieving the top 5 chunks sounds safe. But if 3 of those chunks are tangentially related, you're sending noise to the LLM. Noise produces hallucinations.
Problem 3 — No source attribution.
Users don't trust answers they can't verify. Without source-page citations, a RAG system is just a black box.
My architecture addresses all three.
The Stack
LangChain — orchestration layer
FAISS — vector store (local, fast, no API cost)
Sentence — Transformers embedding model
Transformers — (all-MiniLM-L6-v2)
Groq LLM — inference (llama3-8b-8192)
PyPDF — PDF parsing
Streamlit — UI layer
Why Groq instead of OpenAI? Speed. Groq's inference on llama3 is significantly faster than GPT-3.5 for this use case and costs less at scale. For a system where sub-5-second response time is the goal — inference speed matters.
Why FAISS instead of Pinecone or ChromaDB? For a single-user document Q&A system, a local vector store has zero latency overhead. No API calls. No network round trips. Pure in-memory search.
The Architecture Decision That Matters Most — Chunking Strategy
This is where most RAG implementations go wrong.
I tested three chunking strategies:
Strategy 1 — Fixed size (1000 chars, 200 overlap)
Simple. Fast. But splits mid-sentence frequently. Retrieval quality suffers because semantic meaning gets cut.
Strategy 2 — Recursive character splitting
LangChain's RecursiveCharacterTextSplitter tries to split on paragraphs first, then sentences, then words. Better semantic coherence than fixed size.
Strategy 3 — Smaller chunks with higher overlap (500 chars, 100 overlap)
Counter-intuitive but this is what I landed on. Here's why:
Smaller chunks = more precise retrieval. When a user asks a specific question, a 500-character chunk containing exactly the relevant sentence retrieves better than a 1000-character chunk that contains the relevant sentence plus 500 characters of surrounding noise.
Higher overlap = context preservation. The 100-character overlap ensures that sentences split across chunk boundaries still appear in at least one complete chunk.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
length_function=len,
separators=["\n\n", "\n", ".", "!", "?", ",", " "]
)
Result: Retrieval precision improved measurably. Answers became more specific and less likely to include irrelevant surrounding context.
Precision Over Recall — The Retrieval Decision
Most RAG implementations retrieve top-5 or top-10 chunks. More context = safer answer.
I chose top-3.
Here's why: LLMs hallucinate more when given noisy context than when given limited but precise context. Sending 5 chunks where 2 are irrelevant is worse than sending 3 chunks where all 3 are relevant.
The tradeoff: occasionally the answer requires information from a 4th chunk that gets missed. I mitigated this by using a similarity threshold — chunks below 0.7 cosine similarity are excluded regardless of k.
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"k": 3,
"score_threshold": 0.7
}
)
Source-Page Citations — Building User Trust
Every answer surfaces the source page number from the original PDF.
This required storing page metadata during the chunking phase:
pages = loader.load_and_split()
for i, page in enumerate(pages):
page.metadata["page_number"] = i + 1
Then surfacing it in the response:
source_pages = list(set([
doc.metadata.get("page_number", "Unknown")
for doc in retrieved_docs
]))
response = f"{answer}\n\nSources: Pages {source_pages}"
Why this matters: A user asking a question about a legal document or technical specification needs to verify the answer. "The answer is X (Source: Page y)" is trusted. "The answer is X" is not.
The Two-Module Architecture
I separated the system into two independent modules:
app.py — Streamlit UI layer
rag_engine.py — Retrieval and generation logic
Why does this matter? The RAG engine can be called from any interface — a REST API, a Slack bot, a CLI tool — without touching the UI code. This is production-style architecture. Not a notebook demo.
rag_engine.py
class RAGEngine:
def __init__(self, pdf_path):
self.vectorstore = self._build_vectorstore(pdf_path)
self.chain = self._build_chain()
def answer(self, question):
result = self.chain({"query": question})
return result["result"], result["source_documents"]
Performance — How We Got Under 5 Seconds
End-to-end response time breakdown:
Step Time
Query embedding ~50ms
FAISS similarity search ~10ms
Groq LLM inference ~3–4 seconds
Total ~4–4.5 seconds
The bottleneck is LLM inference — not retrieval. FAISS is extremely fast for in-memory search. Groq's inference speed on llama3 is what makes sub-5-second response time achievable.
For comparison: the same pipeline with GPT-3.5-turbo averaged 8–12 seconds end-to-end. Groq is the reason we hit the performance target.
What I'd Do Differently
1. Add a reranker.
After initial retrieval, a cross-encoder reranker re-scores chunks by relevance to the specific question. This would further improve precision without reducing recall. I'd use a lightweight cross-encoder from sentence-transformers for this.
2. Implement conversation memory properly.
Currently session-level chat history is stored in Streamlit session state. For multi-session persistence — a proper vector-based memory system would be better.
3. Add an evaluation framework.
I have no automated way to measure retrieval quality across questions. A proper RAG evaluation framework using RAGAS would let me measure faithfulness, answer relevance, and context precision systematically.
Result:-
- Answers 100-page documents in under 5 seconds
- Source-page citations on every answer
- Multi-PDF upload supported
- Session-level chat history maintained
- Retrieval precision prioritised over recall
Full Code
github.com/karankavyanjali77-sys/ai-pdf-rag-chatbot
About the Author
Kavyanjali Karan , a pre-final year CS undergrad at SOA University, Bhubaneswar (2027). Has built 8 deployed ML and data engineering systems. Selected for McKinsey Forward Program and Google Gen AI Academy APAC 2026. Currently seeking Data Science and Business Analytics intern roles.
GitHub: github.com/karankavyanjali77-sys
Email: Emails are not allowed