How I Built Rumi: An Autonomous Scientific Discovery Engine That Thinks Before It Asks
Okay so like hear me out. I know "autonomous scientific discovery" sounds like some buzzword soup that a LinkedIn influencer would drop to farm impressions. But I actually built the thing. It reads real papers, finds real contradictions in the literature, and generates real testable hypotheses. No human in the loop. On god.
I'm actively still working on her every single day. She isn't done. She isn't even close to done. Oh yeah and she's getting smarter every cycle.
Let me walk you through how Rumi actually works under the hood, because honestly the architecture is the interesting part -- not the pitch.
THE CORE PROBLEM
Scientific hypothesis generation is painfully slow. Like genuinely painful. A researcher spends months sifting through thousands of papers, manually extracting findings, cross-referencing contradictions, and trying to piece together a testable hypothesis from all that noise. Most drug resistance mechanisms still remain unexplained because nobody has the time to mine contradictions at scale across live literature.
So I asked myself: what if a system could do all of that autonomously? Not just search papers -- actually understand them, build a knowledge graph, find where the literature disagrees with itself, and then formulate hypotheses from those disagreements.
That is Rumi.
THE ARCHITECTURE (This Is Where It Gets Interesting)
Rumi isn't a wrapper around ChatGPT. I want to be clear about that. The brain alone has like 40+ modules. The discovery pipeline has its own engine, the scientist subsystem has its own pipeline, and there's a whole security layer on top. This is a proper system, not a script.
Let me break it down layer by layer.
LAYER 1: THE BRAIN
The brain directory is where the cognitive architecture lives. And I'm not gonna lie, this is the part I'm most proud of.
Active Inference Engine -- This is based on Karl Friston's Free Energy Principle. The idea is simple: organisms minimize surprise. Rumi does the same thing. Before it calls any tool, it predicts the outcome -- will it succeed, how long will it take, how uncertain is the prediction. After the call, it computes the prediction error and updates its world model.
def compute_prediction_error(self, tool_name, prediction, actual_success, actual_duration_ms):
success_error = abs(pred_success - actual_val)
if actual_duration_ms > 0 and pred_duration > 0:
ratio = actual_duration_ms / max(pred_duration, 1)
duration_error = math.log2(ratio) * 0.3 if ratio > 1 else abs(1 - ratio) * 0.3
return min(success_error + duration_error, 2.0)
The learning rate decreases with more observations -- just like a real Bayesian agent. When prediction errors spike, Rumi flags that tool for epistemic foraging, which basically means "I don't understand this well enough yet, let me explore more." That's not a prompt trick. That's actual computational structure.
Hierarchical Active Inference -- This extends the flat active inference engine into three levels: Meta (strategic), Subgoal (tactical), and Action (execution). Each level maintains its own belief state with Bayesian updates. Top-down constraints propagate from meta to action, and bottom-up prediction errors propagate from action to meta. It's literally modeled after how the prefrontal cortex and motor cortex interact. Fr fr.
Each belief state maintains a probability distribution over hypotheses, precision weighting that modulates learning rate, and a variational free energy computation. The Expected Free Energy (EFE) for action selection weighs five factors:
EFE_WEIGHTS = {
"expected_cost": 0.25,
"expected_risk": 0.25,
"information_gain": 0.20,
"goal_relevance": 0.15,
"complexity_penalty": 0.15,
}
This isn't me just calling an LLM and hoping for the best. This is actual decision theory baked into the architecture.
Dreaming System -- When Rumi is idle for 2 minutes, it dreams. No cap. It replays memories, extracts patterns, validates those patterns against reality, and feeds insights into the curiosity module. It's inspired by hippocampal replay during sleep -- the same process that helps humans consolidate memories.
The dreaming system does five things:
- Pattern decay -- unconfirmed patterns lose strength over time (7-day half-life). If a pattern isn't reinforced, it fades. Just like real memory.
- Dream diversity -- it rotates through categories instead of replaying the same stuff. No echo chambers.
- Dream-reality validation -- it tracks whether predicted patterns actually hold true in the real world. Correct predictions get strengthened, wrong ones get weakened.
- Curiosity-informed dreaming -- if the curiosity module flagged something for exploration, the dreaming system prioritizes replaying related memories.
Cross-module consolidation -- patterns discovered during dreaming get fed into the learning engine as new insights.
def _run_dream_cycle(self, force=False):
self._decay_patterns() # Forget weak patterns
patterns = self._replay() # Replay memories, find patterns
# ... store patterns, feed curiosity, consolidate to learning
There's also a metacognitive monitor, a neurosymbolic reasoner (propositional logic engine built from scratch -- no SymPy, no Z3), a theory of mind module, a causal reasoner implementing Pearl's three levels of causation, an intuition engine based on Kahneman's dual-process theory, and like 30 more modules. Each one does something specific. None of them are decorative.
LAYER 2: THE DISCOVERY PIPELINE
This is where Rumi actually does science.
Literature Ingestion -- Rumi doesn't just hit one API. It queries PubMed, Arxiv, Semantic Scholar, OpenAlex, CrossRef (via CIR API), plus domain-specific databases like UniProt for proteins, PubChem for compounds, PDB for structures, GBIF for biodiversity, NASA APIs for earth science, NOAA for climate data, USGS for geological data, World Bank for economic indicators, WHO for health data, and the Materials Project for materials science. It also checks GitHub and OEIS for mathematical sequences. The pipeline is domain-aware -- it routes to the right APIs based on what field you're researching.
Entity Extraction -- This is algorithmic. No LLM calls. Uses NLP patterns to extract entities and relationships from paper titles and abstracts. It recognizes phenomena, theories, measurements, parameters, methods, organizations, and general concepts. Relationship extraction uses co-occurrence analysis plus keyword pattern matching for types like "explains," "constrains," "causes," "measured_by," and "associated_with."
Why algorithmic and not LLM? Because it's fast, reliable, and never hangs. You don't want your discovery pipeline waiting 30 seconds for an LLM to tell you that "KRAS" is a gene.
Knowledge Graph Construction -- Entities and relationships get assembled into a dynamic knowledge graph. The graph tracks entities with their types, aliases, and source papers. Relationships carry confidence scores, source papers, and temporal metadata.
Contradiction Mining -- This is the core innovation and honestly the part that makes Rumi actually useful. The ContradictionMiner looks for four types of contradictions:
Direct contradictions -- same entity pair, opposite relationships. Like Paper A says "Drug X activates Gene Y" and Paper B says "Drug X inhibits Gene Y."
Path contradictions -- contradictions through intermediate entities. Entity A positively regulates Entity C through path 1, but negatively through path 2. These are the sneaky ones that humans miss.
Paper contradictions -- different papers disagree on the same relationship. Goes beyond simple opposite detection -- groups by entity pair across all relation types and flags genuine disagreements with source attribution.
Temporal contradictions -- entity roles that changed over time. If an entity was associated with positive effects in early papers but negative in later ones, that's a signal.
Each contradiction gets a severity score based on how many papers support each side.
Hypothesis Generation -- Contradictions become hypothesis seeds. The HypothesisEngine takes the knowledge graph, detected contradictions, and latent candidates, builds a structured prompt, and sends it through the LLM with retry logic and multi-provider fallback (Groq primary, Gemini backup). Every hypothesis gets:
- Algorithmic confidence scoring (not self-reported by the LLM)
- Novelty checking against existing hypothesis memory
- Automatic novelty capping (because LLMs always overclaim novelty -- I literally had to build a function to downrank "high" novelty claims to "medium")
- Deduplication against prior runs
Persistence to disk with full provenance
def _cap_novelty(self, h):
n = h.get("novelty", "medium")
if n == "high":
h["novelty"] = "medium"
h["novelty_override"] = "downranked_from_high"
LAYER 3: THE SCIENTIST SUBSYSTEM
This is a whole autonomous research team inside Rumi.
Experiment Designer -- Generates experimental protocols. Feynman Reducer -- simplifies complex concepts. Cross-Validator -- checks hypotheses against external data. Peer Reviewer -- simulates peer review. Paper Generator -- drafts research papers. Reproducibility Engine -- checks reproducibility. Knowledge Graph -- maintains a scientist-level knowledge graph. Lab Notebook -- tracks all experimental records.
There's also a hypothesis tournament system where multiple hypotheses compete based on evidence strength, falsifiability, and explanatory power. The best ones survive.
LAYER 4: THE PIPELINE INFRASTRUCTURE
None of this would work without solid infrastructure.
Stage-based execution -- the pipeline runs as a sequence of stages with dependency tracking.
Checkpointing -- every stage saves its output to disk. If it crashes, resume from the last checkpoint.
With retry and multi-provider fallback (Groq primary, Gemini backup). Exponential backoff: 2s, 5s, 15s, 30s.
async def call_with_retry(self, prompt, json_mode=False, max_tokens=32768):
for attempt in range(self.max_retries):
provider = self.providers[provider_idx]
try:
result = await self.call_llm(prompt, json_mode, max_tokens, provider)
if result and len(result) > 20:
return result, provider
except Exception as e:
if any(k in str(e).lower() for k in ("401", "403", "unauthorized")):
failed_providers.add(provider)
await asyncio.sleep(self.backoff[min(attempt, len(self.backoff) - 1)])
The Scientist Subsystem is fully autonomous -- it can generate experimental protocols, do peer review, generate research papers, and run hypothesis tournaments where multiple hypotheses compete for supremacy.
WHAT IT ACTUALLY DOES
You point Rumi at a research topic -- say "KRAS G12C resistance mechanisms in non-small cell lung cancer" -- and it:
- Queries PubMed and related databases
- Extracts entities (genes, compounds, pathways, mutations)
- Builds a knowledge graph with semantic relationships
- Mines contradictions across the literature
- Generates testable hypotheses
- Scores them for confidence, novelty, and falsifiability
- Designs validation experiments
- Persists everything with full provenance
All autonomous. No human in the loop.
WHERE RUMI IS RIGHT NOW
Real talk -- Rumi isn't done. She's not even close to done. I'm actively working on her every single day and the codebase keeps growing. Like I just added new discovery APIs last week and refactored the contradiction miner like three days ago. She's a living system.
The entity extraction is algorithmic but I'm working on making it smarter. The contradiction miner handles the obvious cases but misses the nuanced ones -- context-dependent disagreements where two papers aren't technically contradicting each other but are painting different pictures. That's the next frontier.
The hypothesis quality still depends too much on the LLM underneath, and I'm building out more algorithmic checks to reduce that dependency. The scientist subsystem is functional but the peer reviewer and reproducibility engine need more work.
There's also a whole roadmap I haven't touched yet -- cross-domain hypothesis transfer (taking a mechanism discovered in oncology and checking if it applies in neurodegeneration), real-time literature monitoring that runs continuously instead of on-demand, and better experiment planning that accounts for actual lab constraints instead of generating idealized protocols.
Every week something new gets wired in, something old gets refined, and the architecture gets a little more solid. That's how real systems get built -- not in one shot, but iteratively, with each layer making the next one possible.
If you're reading this and thinking "this is cool but it's probably just a demo" -- it's not. It's a real system that I use, break, fix, and improve. And she's getting smarter every cycle.
THE HONEST TRUTH
Is Rumi perfect? Nah lol. The entity extraction is regex-based so it misses things. The contradiction miner doesn't handle nuanced disagreements well. The hypothesis quality depends heavily on the LLM, and LLMs hallucinate.
But here's the thing -- the architecture is sound. The cognitive scaffolding around the LLM is what makes it work. The active inference learns from its mistakes. The dreaming system consolidates patterns over time. The pipeline infrastructure makes it resilient. You can swap out the LLM and the system still works.
That's the whole point. The architecture around the model matters more than the model itself.
Rumi is open source: https://github.com/subhansh-dev/Rumi