I Wrapped an 8B Model in 40 Neuroscience Modules and It Started Outperforming GPT-4

I Wrapped an 8B Model in 40 Neuroscience Modules and It Started Outperforming GPT-4

posted 5 min read

I Wrapped an 8B Model in 40 Neuroscience Modules and It Started Outperforming GPT-4

Okay, hear me out.

I know that title sounds like clickbait. But I have benchmarks, code, and a story about hitting API rate limits at 3am.

I'm 17. I built FRIDAY — a 95K-line cognitive AI pipeline in Python. It takes Llama-3.1-8B-Instruct (8 billion parameters, free-tier Groq inference) and wraps it in so much cognitive scaffolding that it scores 88% on ARC-Challenge. GPT-4 territory. On an 8B model. For free.

This is the "here's how the architecture actually works" deep dive. Not the fluff piece — I already wrote that one.


The Problem With Big Models

The current playbook: make the model bigger. GPT-4, Claude, Gemini — more parameters = more intelligence. It works. But from an engineering perspective, "throw 100x more compute at it" feels like giving up.

So I asked: what if the architecture around the model is what matters?

FRIDAY is my answer. A cognitive architecture — 40+ modules inspired by neuroscience and psychology — that forces a small model to think before it speaks. Not through prompts. Through computational structure.


The Pipeline: 8 Stages of Forced Reasoning

Every query follows this path:

reason → perceive → plan → simulate → execute → debug → reflect → consolidate

The Routing Decision: Fast or Slow?

Inspired by Kahneman's dual-process theory (System 1 vs System 2), FRIDAY decides whether a query needs fast intuition or deep deliberation:

FAST_PATH_CONFIDENCE = 0.75

def _fast_path(self, request, context, response):
    domain = context.get("domain", "general")
    success, result = self._call_module("intuition", "recognize", request, domain)

    if success and result:
        action, confidence, match_info = result
        if action and confidence >= FAST_PATH_CONFIDENCE:
            emo_success, emo_result = self._call_module("emotional", "affect_heuristic", action)
            if emo_success and emo_result:
                valence = emo_result.get("emotional_valence", 0.0)
                response.confidence = clamp(confidence + valence * 0.1)
            response.response = action
            response.path = "fast"
            return True
    return False

The Intuition Engine checks if it's seen something like this before. If yes, and confident (>0.75), in under 100ms — fast path. No extra LLM calls. This is how FRIDAY handles "what time is it" without engaging 40 modules.

The Deliberative Pipeline

If fast path fails, we enter 8 cognitive stages:

Metacognitive Strategy → Emotional Priming → Module Competition →
Causal Reasoning → Analogical Reasoning → Creativity →
World Model Simulation → Neurosymbolic Verification

Each module contributes weighted evidence. The key: graceful degradation. If a module fails or times out (5s default), the pipeline keeps going. This is why FRIDAY had zero errors across 535 benchmark questions — not because nothing failed, but because the system never crashes when they do.


The Modules That Actually Matter

Intuition Engine: Kahneman Meets Klein

Implements two models simultaneously — Kahneman's System 1 (fast pattern recognition) and Gary Klein's Recognition-Primed Decision model (how experts decide under pressure).

Each pattern is a 12-dimensional feature vector. Not embeddings. Hand-crafted features:

def _extract_features(self, text):
    words = text.lower().split()
    n = len(words)
    f_len = min(1.0, n / 100.0)
    f_avg_wl = min(1.0, avg_wl / 15.0)
    f_uniq = len(set(words)) / max(n, 1)
    f_q = 1.0 if "?" in text else 0.0
    # ... 8 more features
    return [f_len, f_avg_wl, f_uniq, f_q, ...]

Pattern matching is cosine similarity. No LLM calls. Under 100ms.

Patterns decay using Ebbinghaus' forgetting curve:

DECAY_HALF_LIFE_DAYS = 60
decay_factor = 2 ** (-days_since_use / DECAY_HALF_LIFE_DAYS)

Active Inference: Friston's Free Energy Principle

The core idea: organisms minimize surprise. Predict what happens, update when wrong.

def compute_prediction_error(self, tool_name, prediction, actual_success, actual_duration_ms):
    success_error = abs(prediction["expected_success"] - (1.0 if actual_success else 0.0))
    if actual_duration_ms > 0 and prediction["expected_duration_ms"] > 0:
        ratio = actual_duration_ms / max(prediction["expected_duration_ms"], 1)
        duration_error = math.log2(ratio) * 0.3 if ratio > 1 else abs(1 - ratio) * 0.3
    return min(success_error + duration_error, 2.0)

High prediction errors trigger epistemic foraging — "I don't understand this well enough, explore more."

Hierarchical Active Inference: 3 Levels of Belief

Three levels — Meta (strategic), Subgoal (tactical), Action (execution). Each maintains a belief state with Bayesian updates:

def update(self, observation, learning_rate=0.1):
    effective_lr = learning_rate * self.precision
    for hyp, likelihood in observation.items():
        if hyp in self.hypotheses:
            prior = self.hypotheses[hyp]
            self.hypotheses[hyp] = prior + effective_lr * (likelihood - prior)
    for hyp in list(self.hypotheses.keys()):
        if hyp not in observation:
            self.hypotheses[hyp] *= 0.95

Bidirectional: top-down constraints propagate down, execution errors propagate up. Mirrors prefrontal-motor cortex interaction.

Cognitive Appraisal: How Emotions Get Generated

Using Lazarus' theory — two evaluation levels:

  • Primary: Is this relevant? Good or bad?
  • Secondary: What can I do about it?

Maps to coping strategies (problem-focused, reappraisal, avoidance, etc.) that affect downstream reasoning — confidence, risk tolerance, exploration vs exploitation.

Metacognitive Monitor: Thinking About Thinking

  • Confidence calibration: tracks whether confidence matches reality (overconfidence threshold: 0.15)
  • Error pattern detection: scans last 200 errors, flags recurring patterns
  • Fatigue detection: notices >20% accuracy drop over 30 interactions

Cognitive Load: Miller's 7±2

Implements Sweller's Cognitive Load Theory:

WORKING_MEMORY_SLOTS = 7
MODULE_COSTS = {
    "active_inference": 0.10, "dreaming": 0.15,
    "causal_reasoner": 0.15, "intuition_engine": 0.05,
    # ... 30+ modules
}

When load exceeds capacity, sheds lower-priority modules. Better slightly less thorough than crashing.

Memory Systems

Episodic: timestamped event log. Associative: spreading activation network (Collins & Loftus, 1975) — recall one memory, connected ones activate. Predictive: anticipates what you'll need. Consolidation: sleep-like processing (McClelland et al., 1995) — compresses episodic into semantic knowledge every 6 hours.

The Dreaming System

When idle for 2 minutes, FRIDAY dreams. Replays memories, extracts patterns, validates against reality. Inspired by hippocampal replay during sleep.

Self-Awareness Module

  • Introspection Engine: examines own reasoning
  • Self-Narrative: continuous identity across sessions
  • Theory of Mind: models user's mental state
  • Bias Detection: monitors for 12 cognitive biases (confirmation, anchoring, Dunning-Kruger, etc.)

Is it "real" self-awareness? It's functional self-monitoring that improves output quality. Whether that counts as consciousness is above my pay grade.

Causal Reasoner: Pearl's Three Levels

Full causal hierarchy — Association (P(Y|X)), Intervention (P(Y|do(X=x))), Counterfactual. Causal edges have strength, mechanism, confidence — and decay if not reinforced.

Neurosymbolic Reasoner

Combines neural (LLM) and symbolic (formal logic) reasoning. Propositional logic engine built from scratch — no SymPy, no Z3. Can verify code invariants and do formal verification.


The Benchmarks

All on Groq's Llama-3.1-8B-Instruct. Single-shot pass@1. No tricks.

Benchmark Accuracy Questions Avg Time
ARC-Challenge 88.0% 50 46.2s
GSM8K 85.0% 100 26.5s
TruthfulQA 71.0% 100 37.2s
ARC-Easy 68.0% 50 30.6s
MMLU 61.0% 100 21.0s
GPQA 42.0% 50 60.0s

535 questions. Zero errors.

ARC-Challenge at 88% is the standout — genuine multi-step reasoning. TruthfulQA at 71% is interesting — the pipeline helps resist confident wrong answers. MMLU at 61% is nuanced: 100% on heavy conceptual subjects, below baseline on trivia (the over-thinking penalty).


Key Design Decisions

  1. Graceful degradation everywhere — every import is try/except'd
  2. Thread-safe JSON persistence — state survives crashes
  3. No heavy dependencies — logic engine from scratch, features hand-crafted
  4. Prediction-error learning — self-improving feedback loop
  5. Module competition — multiple proposals, best wins

What's Next

  • Routing layer to avoid the MMLU over-thinking penalty
  • Scaling to Llama-3.1-70B
  • More benchmarks: HellaSwag, WinoGrande, HumanEval
  • 200+ samples per benchmark

I'm 17 and building cognitive AI systems. If you're interested in the architecture, benchmarks, or just want to argue about whether any of this constitutes "real" reasoning — I'm at Emails are not allowed.

Source: github.com/subhansh-dev/Friday-Autonomous-Cognitive-AI-Operating-System

More Posts

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

Masbadar - Mar 12

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapse - Apr 20

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

snapsynapse - Apr 20

Your AI Doesn't Just Write Tests. It Runs Them Too.

Kevin Martinez - May 12
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!