Most teams use one signal to judge deploy risk: how big is this PR?
It's intuitive. A 50-line change feels safer than a 2,000-line change. But it's also wrong — or at least deeply incomplete. Some of the highest-risk PRs ever merged into production were under 100 lines. React Hooks shipped in a PR that touched 0 lines of application logic. The TypeScript module migration that broke nearly every TS project worldwide was a deceptively "small" config change.
Size is a signal. It's just not the most important one.
Over the past year building Koalr — a deployment intelligence platform — I've gone deep into the research on what actually predicts deployment failures. The field is called Just-In-Time (JIT) Software Defect Prediction, and the academic literature is 15+ years deep. Here's what I found, and what we built.
The Research Foundation
The canonical study is Kamei et al. (IEEE TSE 2013): a large-scale empirical analysis of JIT defect prediction across six open-source projects. The core finding: defects are predictable at commit time with surprisingly high accuracy if you use the right features.
Microsoft Research extended this across Windows Vista, Eclipse, and Firefox, adding file-level ownership data and discovering that code ownership was one of the strongest predictors of post-release failures — not just who wrote the code, but how many people understood it.
The most recent work (PMC 2023) introduced developer graph topology — network centrality features — and achieved 152% higher F1 score than traditional signal sets alone.
We built our model on this foundation.
Hard Gates First
Before any weighted score, some conditions are binary. No amount of "mostly good signals" overrides them:
- DDL migration detected — schema changes that acquire exclusive table locks are the single most common root cause of P0 outages. Mandatory DBA review, no exceptions.
- Active incident in the last 4 hours — deploying into a degraded service compounds the blast radius.
- Error budget at 0% — if you've burned your SLO error budget, you're already in borrowed time.
- CVSS 9.0+ CVE in a newly introduced dependency — block by default.
Hard gates exist because weighted scores can be gamed or miscalibrated. Some conditions just aren't negotiable.
The Surprising Top Signals
1. Change Entropy (weight: 0.11 — highest in our model)
Not change size, but change diffusion. Shannon Entropy is calculated per subsystem — the more diffused the change, the higher the score.
A 500-line change concentrated in one file is much lower risk than a 200-line change spread across 15 modules in 4 subsystems. The Windows Vista research found entropy-based features achieved 90%+ precision/recall, significantly outperforming complexity metrics alone.
2. Author Expertise in Specifically Changed Files (weight: 0.10)
Not "is this developer senior?" but "has this developer specifically worked on these files before?" Microsoft found that files touched by developers with low file-specific expertise have substantially higher post-release failure rates — holding across three separate codebases.
We compute this via per-file git blame over a 12-month rolling window. A senior engineer touching an unfamiliar subsystem is a different risk profile than a junior engineer in their own module.
3. Minor Contributor Count (weight: 0.09)
The number of developers with less than 5% of commits to a changed file. More minor contributors = "nobody fully owns this code." High minor contributor density correlates strongly with defect rates — not because the contributors are bad, but because fragmented ownership means fragmented understanding.
Why Interaction Effects Matter More Than Individual Signals
Here's where most implementations fail: they treat signals independently.
A high-churn PR from a senior author with full test coverage on the changed lines is low risk. The same churn from an author new to that module, with no reviewer overlap from people familiar with those files, with 0% test coverage on the diff — that's critical.
The signals only tell you something meaningful in combination. This is why we weight and combine them rather than threshold on any single value, and why customer-specific calibration on their actual incident history matters: the weights that work for a FinTech with SOC 2 requirements are different from the weights for a fast-moving startup.
The Semantic Layer
Quantitative signals can't catch everything. PR descriptions and review comments carry signal that numbers miss entirely.
We run Claude Haiku over PR descriptions looking for phrases that empirically correlate with post-merge incidents:
- "quick fix" / "should be fine" / "probably okay"
- "temporarily disabling" / "bypassing validation"
- "will fix later" / "works around"
A PR that scores 45/100 on quantitative signals but has a description saying "disabling rate limiting temporarily while we figure out the auth issue" should be treated very differently. The LLM layer catches that.
What This Looks Like in Practice
If you want to see the model applied to real code, I ran it against some famous open-source PRs — React Hooks, the TypeScript module migration, Svelte 5's rewrite. Some results are counterintuitive: https://koalr.com/blog/famous-open-source-prs-deploy-risk-scores
The live version of the scorer is at https://app.koalr.com/live-risk-demo — no account required, paste any public GitHub PR URL.
The research is clear: deployment failures are largely predictable before the merge button is clicked. The industry just hasn't built the tooling to act on it yet.