When something is not working in an AI system, the first instinct is always the same add more to it. More layers. More features. More complexity. It feels like progress because you are actively doing something.
I fell into this exact trap while building FIE, an LLM failure detection system.
The 434-Feature Problem
The core of FIE's failure classifier is an XGBoost model. Its job: take a response from an LLM and decide whether it is a hallucination, an adversarial output, or a legitimate answer.
To make it "smarter," I built a 434-dimensional feature vector. Entropy scores. Consistency metrics across rephrased queries. Pairwise disagreement between three shadow models. Semantic distance between prompt and response. Every angle I could measure, I measured.
The model trained fine. Metrics looked solid. I felt good about it.
Then I ran SHAP analysis to understand what the model was actually using.
What SHAP Said
Out of 434 features, two dominated everything:
- Agreement score — do multiple models give the same answer?
- Jury verdict — does the diagnostic agent flag it as a hallucination?
These two features alone carried more predictive weight than the remaining 432 combined.
The other features were not useless — they contributed at the margins. But the signal that actually drove decisions came down to two questions: do the models agree, and does it look like a hallucination?
Everything else was noise dressed up as signal.
The Slim Model
I rebuilt the classifier using only the top 10 features from the SHAP analysis. Stripped out 424 features entirely.
The AUC was identical to the full 434-feature model.
Same performance. A fraction of the complexity. Faster inference, simpler debugging, easier to maintain.
The 434-feature version had felt thorough. The 10-feature version was thorough — because it kept only what actually mattered.
The Real Lesson
Complexity creates an illusion of capability. When a system has hundreds of features and dozens of layers, it looks like it should be good. That confidence is dangerous because it stops you from asking what the system is actually learning.
The features that generalise are almost never the complex ones. They are the ones closest to the core question you are trying to answer. In my case: do independent sources agree, and does this output look like a failure? Two signals. Direct. Interpretable. Sufficient.
Before adding the next layer, run the analysis first. Find out what your model is actually using. The answer is almost always simpler than what you built.