Okay, hear me out.
I know that title sounds like clickbait. But I have benchmarks, code, and a story about hitting API rate limits at 3am.
I'm 17. I built FRIDAY — a 95K-line cognitive AI pipeline in Python. It takes Llama-3.1-8B-Instruct (8 billion parameters, free-tier Groq inference) and wraps it in so much cognitive scaffolding that it scores 88% on ARC-Challenge. GPT-4 territory. On an 8B model. For free.
This is the "here's how the architecture actually works" deep dive. Not the fluff piece — I already wrote that one.
The Problem With Big Models
The current playbook: make the model bigger. GPT-4, Claude, Gemini — more parameters = more intelligence. It works. But from an engineering perspective, "throw 100x more compute at it" feels like giving up.
So I asked: what if the architecture around the model is what matters?
FRIDAY is my answer. A cognitive architecture — 40+ modules inspired by neuroscience and psychology — that forces a small model to think before it speaks. Not through prompts. Through computational structure.
The Pipeline: 8 Stages of Forced Reasoning
Every query follows this path:
reason → perceive → plan → simulate → execute → debug → reflect → consolidate
The Routing Decision: Fast or Slow?
Inspired by Kahneman's dual-process theory (System 1 vs System 2), FRIDAY decides whether a query needs fast intuition or deep deliberation:
FAST_PATH_CONFIDENCE = 0.75
def _fast_path(self, request, context, response):
domain = context.get("domain", "general")
success, result = self._call_module("intuition", "recognize", request, domain)
if success and result:
action, confidence, match_info = result
if action and confidence >= FAST_PATH_CONFIDENCE:
emo_success, emo_result = self._call_module("emotional", "affect_heuristic", action)
if emo_success and emo_result:
valence = emo_result.get("emotional_valence", 0.0)
response.confidence = clamp(confidence + valence * 0.1)
response.response = action
response.path = "fast"
return True
return False
The Intuition Engine checks if it's seen something like this before. If yes, and confident (>0.75), in under 100ms — fast path. No extra LLM calls. This is how FRIDAY handles "what time is it" without engaging 40 modules.
The Deliberative Pipeline
If fast path fails, we enter 8 cognitive stages:
Metacognitive Strategy → Emotional Priming → Module Competition →
Causal Reasoning → Analogical Reasoning → Creativity →
World Model Simulation → Neurosymbolic Verification
Each module contributes weighted evidence. The key: graceful degradation. If a module fails or times out (5s default), the pipeline keeps going. This is why FRIDAY had zero errors across 535 benchmark questions — not because nothing failed, but because the system never crashes when they do.
The Modules That Actually Matter
Intuition Engine: Kahneman Meets Klein
Implements two models simultaneously — Kahneman's System 1 (fast pattern recognition) and Gary Klein's Recognition-Primed Decision model (how experts decide under pressure).
Each pattern is a 12-dimensional feature vector. Not embeddings. Hand-crafted features:
def _extract_features(self, text):
words = text.lower().split()
n = len(words)
f_len = min(1.0, n / 100.0)
f_avg_wl = min(1.0, avg_wl / 15.0)
f_uniq = len(set(words)) / max(n, 1)
f_q = 1.0 if "?" in text else 0.0
# ... 8 more features
return [f_len, f_avg_wl, f_uniq, f_q, ...]
Pattern matching is cosine similarity. No LLM calls. Under 100ms.
Patterns decay using Ebbinghaus' forgetting curve:
DECAY_HALF_LIFE_DAYS = 60
decay_factor = 2 ** (-days_since_use / DECAY_HALF_LIFE_DAYS)
Active Inference: Friston's Free Energy Principle
The core idea: organisms minimize surprise. Predict what happens, update when wrong.
def compute_prediction_error(self, tool_name, prediction, actual_success, actual_duration_ms):
success_error = abs(prediction["expected_success"] - (1.0 if actual_success else 0.0))
if actual_duration_ms > 0 and prediction["expected_duration_ms"] > 0:
ratio = actual_duration_ms / max(prediction["expected_duration_ms"], 1)
duration_error = math.log2(ratio) * 0.3 if ratio > 1 else abs(1 - ratio) * 0.3
return min(success_error + duration_error, 2.0)
High prediction errors trigger epistemic foraging — "I don't understand this well enough, explore more."
Hierarchical Active Inference: 3 Levels of Belief
Three levels — Meta (strategic), Subgoal (tactical), Action (execution). Each maintains a belief state with Bayesian updates:
def update(self, observation, learning_rate=0.1):
effective_lr = learning_rate * self.precision
for hyp, likelihood in observation.items():
if hyp in self.hypotheses:
prior = self.hypotheses[hyp]
self.hypotheses[hyp] = prior + effective_lr * (likelihood - prior)
for hyp in list(self.hypotheses.keys()):
if hyp not in observation:
self.hypotheses[hyp] *= 0.95
Bidirectional: top-down constraints propagate down, execution errors propagate up. Mirrors prefrontal-motor cortex interaction.
Cognitive Appraisal: How Emotions Get Generated
Using Lazarus' theory — two evaluation levels:
- Primary: Is this relevant? Good or bad?
- Secondary: What can I do about it?
Maps to coping strategies (problem-focused, reappraisal, avoidance, etc.) that affect downstream reasoning — confidence, risk tolerance, exploration vs exploitation.
- Confidence calibration: tracks whether confidence matches reality (overconfidence threshold: 0.15)
- Error pattern detection: scans last 200 errors, flags recurring patterns
- Fatigue detection: notices >20% accuracy drop over 30 interactions
Cognitive Load: Miller's 7±2
Implements Sweller's Cognitive Load Theory:
WORKING_MEMORY_SLOTS = 7
MODULE_COSTS = {
"active_inference": 0.10, "dreaming": 0.15,
"causal_reasoner": 0.15, "intuition_engine": 0.05,
# ... 30+ modules
}
When load exceeds capacity, sheds lower-priority modules. Better slightly less thorough than crashing.
Memory Systems
Episodic: timestamped event log. Associative: spreading activation network (Collins & Loftus, 1975) — recall one memory, connected ones activate. Predictive: anticipates what you'll need. Consolidation: sleep-like processing (McClelland et al., 1995) — compresses episodic into semantic knowledge every 6 hours.
The Dreaming System
When idle for 2 minutes, FRIDAY dreams. Replays memories, extracts patterns, validates against reality. Inspired by hippocampal replay during sleep.
Self-Awareness Module
- Introspection Engine: examines own reasoning
- Self-Narrative: continuous identity across sessions
- Theory of Mind: models user's mental state
- Bias Detection: monitors for 12 cognitive biases (confirmation, anchoring, Dunning-Kruger, etc.)
Is it "real" self-awareness? It's functional self-monitoring that improves output quality. Whether that counts as consciousness is above my pay grade.
Causal Reasoner: Pearl's Three Levels
Full causal hierarchy — Association (P(Y|X)), Intervention (P(Y|do(X=x))), Counterfactual. Causal edges have strength, mechanism, confidence — and decay if not reinforced.
Neurosymbolic Reasoner
Combines neural (LLM) and symbolic (formal logic) reasoning. Propositional logic engine built from scratch — no SymPy, no Z3. Can verify code invariants and do formal verification.
The Benchmarks
All on Groq's Llama-3.1-8B-Instruct. Single-shot pass@1. No tricks.
| Benchmark | Accuracy | Questions | Avg Time |
| ARC-Challenge | 88.0% | 50 | 46.2s |
| GSM8K | 85.0% | 100 | 26.5s |
| TruthfulQA | 71.0% | 100 | 37.2s |
| ARC-Easy | 68.0% | 50 | 30.6s |
| MMLU | 61.0% | 100 | 21.0s |
| GPQA | 42.0% | 50 | 60.0s |
535 questions. Zero errors.
ARC-Challenge at 88% is the standout — genuine multi-step reasoning. TruthfulQA at 71% is interesting — the pipeline helps resist confident wrong answers. MMLU at 61% is nuanced: 100% on heavy conceptual subjects, below baseline on trivia (the over-thinking penalty).
Key Design Decisions
- Graceful degradation everywhere — every import is try/except'd
- Thread-safe JSON persistence — state survives crashes
- No heavy dependencies — logic engine from scratch, features hand-crafted
- Prediction-error learning — self-improving feedback loop
- Module competition — multiple proposals, best wins
What's Next
- Routing layer to avoid the MMLU over-thinking penalty
- Scaling to Llama-3.1-70B
- More benchmarks: HellaSwag, WinoGrande, HumanEval
- 200+ samples per benchmark
I'm 17 and building cognitive AI systems. If you're interested in the architecture, benchmarks, or just want to argue about whether any of this constitutes "real" reasoning — I'm at Emails are not allowed.
Source: github.com/subhansh-dev/Friday-Autonomous-Cognitive-AI-Operating-System