I benchmarked RAG vs GraphRAG vs pre-structured knowledge graphs across 45 domains — here's what happened

Question

I benchmarked RAG vs GraphRAG vs pre-structured knowledge graphs across 45 domains — here's what happened

Daniel Yarmoluk posted Apr 27 Originally published at dev.to 2 min read

Three retrieval architectures. Same LLM. Same 7,928 queries across 45 domains. Different structure going in.

Here are the results:

System	F1 Score	Tokens/query	Cost/query
RAG (FAISS + Claude)	0.123	2,982	~$0.009
GraphRAG (Microsoft)	0.120	3,450	~$0.013
CKG (pre-structured DAG)	0.471	269	~$0.001

CKG is 4x more accurate and uses 11x fewer tokens than RAG.

What is a CKG?

A Compact Knowledge Graph (CKG) pre-structures domain knowledge as a directed acyclic graph (DAG). Concepts are nodes. Dependencies are edges. A CSV file:

ConceptID,ConceptLabel,Dependencies,TaxonomyID
1,Calculus,2|3,CORE
2,Algebra,,FOUND
3,Trigonometry,,FOUND

When an agent asks "what do I need to know before Calculus?", CKG traverses edges. No embedding. No similarity search. No hallucination by construction.

Why RAG fails on multi-hop queries

RAG retrieves the most similar text chunk to a query. For simple lookups, this works. For multi-hop questions — prerequisites, dependency chains, drug interactions, regulatory trees — it fragments the answer across chunks that contradict each other.

F1 by hop depth:

Hop depth	CKG	RAG
1	0.374	0.312
2	0.512	0.298
3	0.631	0.241
4	0.714	0.198
5	0.772	0.187

CKG improves continuously with depth. RAG plateaus at hop=2 and degrades. The deeper the question, the larger the gap.

Where CKG dominates by query type

Query type	CKG	RAG	Advantage
Aggregate (T4)	0.964	0.286	3.4x
Path traversal (T3)	0.660	0.201	3.3x
Dependency (T2)	0.634	0.078	8.1x
Cross-concept (T5)	0.323	0.115	2.8x
Entity lookup (T1)	0.207	0.094	2.2x

The biggest win (8.1x) is on dependency queries — the exact query type that matters in clinical, legal, financial, and regulatory domains.

Structure is the signal — not curation effort

Track 2: I built a GLP-1/pharma domain from the ClinicalTrials.gov API in a single session. No expert curation.

F1 = 0.530 — higher than the 45-domain average.

If a domain has knowable dependencies, it can be CKG-ified. The structure drives accuracy, not the effort.

Try it

MCP server — works in Claude Code and any MCP-compatible agent:

pip install ckg-mcp

Your agent gets 4 tools: list_domains, query_ckg, get_prerequisites, search_concepts.

Live demo: https://huggingface.co/spaces/danyarm/ckg-demo

Full dataset (45 domain CSVs + 7,928 query JSONL + results):
https://huggingface.co/datasets/danyarm/ckg-benchmark

Paper + benchmark code:
https://github.com/Yarmoluk/ckg-benchmark

One-page summary:
https://github.com/Yarmoluk/ckg-benchmark/blob/main/SUMMARY.md

Custom domains

The benchmark covers 45 general domains. For clinical, legal, financial, or regulatory domains where dependency structure is critical: graphifymd.com

All code MIT licensed. Data CC BY 4.0. Questions welcome in the comments.

3 Comments

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

J.Bruni · Answer 1 · 2026-04-29T05:28:50+0000

J.Bruni • Apr 29

The numbers look great for CKG, but feels like a bit of an unfair comparison vs RAG GraphRAG
Did you try hybrid setups or strictly pure systems?

Daniel Yarmoluk • May 1

@[J.Bruni] Fair question — these are pure system comparisons, intentionally.

The benchmark is designed to isolate the variable: what does structure alone contribute, before you add any hybrid layer? If I tested CKG + RAG fallback against vanilla RAG, any gain could be attributed to the ensemble rather than the structure. Pure comparison is the only way to measure the signal cleanly.

Hybrid setups are a legitimate next step and probably the right production architecture for domains where the graph is incomplete. The benchmark gives you a baseline to know what you're trading away when you add retrieval back in — right now that tradeoff is 4× F1 and 11× tokens.

The methodology section of the paper covers this: https://github.com/Yarmoluk/ckg-benchmark/blob/main/paper/main.pdf

Happy to discuss specific hybrid configurations if you've tried them — curious whether retrieval fallback on low-confidence CKG traversals closes the gap on T1 (entity lookup), which is where CKG is weakest.

Daniel Yarmoluk • May 2

@[J.Bruni] hybrid as well. I had to work with compact files in RAG.

	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapse - Apr 20
	Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI Masbadar - Mar 12
	I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules snapsynapse - Apr 20
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	Defending Against AI Worms: Securing Multi-Agent Systems from Self-Replicating Prompts alessandro_pignati - Apr 2

I benchmarked RAG vs GraphRAG vs pre-structured knowledge graphs across 45 domains — here's what happened

What is a CKG?

Why RAG fails on multi-hop queries

Where CKG dominates by query type

Structure is the signal — not curation effort

Try it

Custom domains

3 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Defending Against AI Worms: Securing Multi-Agent Systems from Self-Replicating Prompts

More From Daniel Yarmoluk

I built an MCP server that gives AI agents structural context before they act — 65x more efficient t

I mapped LangChain Core as a knowledge graph — here's what the structure reveals

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,249 amazing developers

Don't have an account? Sign up

OR

I benchmarked RAG vs GraphRAG vs pre-structured knowledge graphs across 45 domains — here's what happened

What is a CKG?

Why RAG fails on multi-hop queries

Where CKG dominates by query type

Structure is the signal — not curation effort

Try it

Custom domains

3 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Defending Against AI Worms: Securing Multi-Agent Systems from Self-Replicating Prompts

More From Daniel Yarmoluk

I built an MCP server that gives AI agents structural context before they act — 65x more efficient t

I mapped LangChain Core as a knowledge graph — here's what the structure reveals

Related Jobs

Commenters (This Week)