I benchmarked RAG vs GraphRAG vs pre-structured knowledge graphs across 45 domains — here's what happened

I benchmarked RAG vs GraphRAG vs pre-structured knowledge graphs across 45 domains — here's what happened

posted Originally published at dev.to 2 min read

Three retrieval architectures. Same LLM. Same 7,928 queries across 45 domains. Different structure going in.

Here are the results:

System F1 Score Tokens/query Cost/query
RAG (FAISS + Claude) 0.123 2,982 ~$0.009
GraphRAG (Microsoft) 0.120 3,450 ~$0.013
CKG (pre-structured DAG) 0.471 269 ~$0.001

CKG is 4x more accurate and uses 11x fewer tokens than RAG.

What is a CKG?

A Compact Knowledge Graph (CKG) pre-structures domain knowledge as a directed acyclic graph (DAG). Concepts are nodes. Dependencies are edges. A CSV file:

ConceptID,ConceptLabel,Dependencies,TaxonomyID
1,Calculus,2|3,CORE
2,Algebra,,FOUND
3,Trigonometry,,FOUND

When an agent asks "what do I need to know before Calculus?", CKG traverses edges. No embedding. No similarity search. No hallucination by construction.

Why RAG fails on multi-hop queries

RAG retrieves the most similar text chunk to a query. For simple lookups, this works. For multi-hop questions — prerequisites, dependency chains, drug interactions, regulatory trees — it fragments the answer across chunks that contradict each other.

F1 by hop depth:

Hop depth CKG RAG
1 0.374 0.312
2 0.512 0.298
3 0.631 0.241
4 0.714 0.198
5 0.772 0.187

CKG improves continuously with depth. RAG plateaus at hop=2 and degrades. The deeper the question, the larger the gap.

Where CKG dominates by query type

Query type CKG RAG Advantage
Aggregate (T4) 0.964 0.286 3.4x
Path traversal (T3) 0.660 0.201 3.3x
Dependency (T2) 0.634 0.078 8.1x
Cross-concept (T5) 0.323 0.115 2.8x
Entity lookup (T1) 0.207 0.094 2.2x

The biggest win (8.1x) is on dependency queries — the exact query type that matters in clinical, legal, financial, and regulatory domains.

Structure is the signal — not curation effort

Track 2: I built a GLP-1/pharma domain from the ClinicalTrials.gov API in a single session. No expert curation.

F1 = 0.530 — higher than the 45-domain average.

If a domain has knowable dependencies, it can be CKG-ified. The structure drives accuracy, not the effort.

Try it

MCP server — works in Claude Code and any MCP-compatible agent:

pip install ckg-mcp

Your agent gets 4 tools: list_domains, query_ckg, get_prerequisites, search_concepts.

Live demo: https://huggingface.co/spaces/danyarm/ckg-demo

Full dataset (45 domain CSVs + 7,928 query JSONL + results):
https://huggingface.co/datasets/danyarm/ckg-benchmark

Paper + benchmark code:
https://github.com/Yarmoluk/ckg-benchmark

One-page summary:
https://github.com/Yarmoluk/ckg-benchmark/blob/main/SUMMARY.md

Custom domains

The benchmark covers 45 general domains. For clinical, legal, financial, or regulatory domains where dependency structure is critical: graphifymd.com


All code MIT licensed. Data CC BY 4.0. Questions welcome in the comments.

More Posts

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

Masbadar - Mar 12

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

snapsynapseverified - Apr 20

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Defending Against AI Worms: Securing Multi-Agent Systems from Self-Replicating Prompts

alessandro_pignati - Apr 2
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!