Most of the release processes look good. Pipelines are green, deploys happen all the time, and nothing is on fire. When something breaks, you only see the cracks, and suddenly you’re trying to answer questions that your tooling doesn’t help with.
DORA metrics are speed and stability. But they don't tell you if rollback works, if approvals live outside of Slack, or if the artifact you tested is the one in production. You can hit all four DORA metrics and still be one bad deploy away from a bad day.
Here are seven signs your process is more fragile than you realize. Individually they seem small. But they don't hold up well under pressure together.
1. "Who deployed this?" takes more than a minute to answer
This is asked almost as soon as a serious incident occurs. If you have to go through CI logs, scroll through Slack, or hope someone remembers to answer your questions, your release process has no operational memory.
The data is out there somewhere, but it’s not structured as a release record; it’s scattered across pipeline runs, commit messages, conversations. Running a CI does not capture the actual deployment intent, and so engineers open multiple systems during incident calls to reconstruct the course of events.
2. The rollback lives in a doc, not in the system
A healthy, well-defined rollback scenario is of utmost importance. Most teams talk about having a rollback plan, but hardly anyone’s ever put one to the test during a real incident.
Here’s what really happens: you scramble to find the old tag, kick off a pipeline, cross your fingers that the environment hasn’t changed, and try to remember all those finicky parameters. Every single step is a chance for things to break. As a post-mortem on approval-gated releases said it directly: “rollback instructions existed in documents, but not in executable policy.”
So you end up staring into that gap, right when pressure’s at its worst. And here is where you start to improvise chaotically.
3. Deployments come down to whoever’s got access and enough urgency to use it
Production access until there is an explicit process usually comes down to who can deploy and needs to do it quickly. That feels like efficiency on a normal day. In a risky change it gets murky: was this checked? this the correct artifact? who is responsible for the go decision?
If you’ve got repo access with deploy permissions, merging code is often synonymous with deploying it. That’s not really a decision, that’s just the way things happen when CI is doing the deployments.
From what I can tell this only happens when something has gone wrong and suddenly no one knows exactly who owns the release.
4. Your deployment metrics are good because you’re measuring the wrong thing
Pass/fail is the wrong indicator here. A release that makes it through on the third try after two reruns is not the same as one that went through cleanly the first time.
The rerun ratio is the most overlooked stat by most teams. Successful releases with high rerun rates are symptoms of brittle steps, flaky dependencies, and environment inconsistencies (even when the terminal status is green). One of the most underdiagnosed CI/CD challenges is environment inconsistency, as it doesn’t show up in pass/fail rates until a regression hits production.
If your deployment telemetry does not distinguish between first-run success and ultimate success, you are missing a reliability signal that is important.
5. Staging and production run different artifacts
If you rebuild per environment instead of promoting the same artifact, there’s no guarantee that what passed QA is what ends up in production. Dependencies can resolve differently, configs can drift, and suddenly you’re debugging a bug you’ve never seen in staging.
The fix is simple in theory: build once, promote the same artifact everywhere. But many CI setups rebuild by default, because that’s how the pipeline is wired.
Most of the time, no one notices until a regression makes it obvious.
If deployments run through your CI, a platform issue can take out both builds and releases at the same time. The moment you most need to ship a fix is the moment you can’t.
I can say that this isn’t rare. Looking at a set of failed or delayed releases, the root cause was often not bad code but a fragile setup: wrong variables in CI, missing approvals, and unclear artifact versions.
When your release layer depends entirely on your build layer, their availability becomes the same.
7. Deployment knowledge concentrates on one or two people
What I mean is that the team's ability to ship depends on specific individuals being present and available.
You’ll agree that this is the least dramatic sign. Yet I’m convinced it’s the most damaging long-term. Teams usually notice it when the "deployment person" takes a holiday and suddenly nobody is confident releasing anything. Or when someone leaves and takes the full process knowledge with them.
Deployment governance isn't just an operational concern. It's a scaling and team health concern.
Why do these signs appear together?
None of these is a crisis on its own. That’s why they tend to stick around for so long.
Together, they point to a release process optimized for the easy case: normal days, routine deploys, no incidents. The root cause is the same across all seven. CI and CD are different responsibilities. One is about validating code, the other is about governing how it moves. But most setups try to handle both with a single system designed for only one of them.
If you've gone deeper on the CI vs. CD distinction, the pattern becomes clear quickly. The coupling feels efficient early and becomes fragile at scale. Build maintainers end up carrying production risk decisions they don't own. Deployment logic accumulates inside pipelines built for something else.
As one measurement framework puts it: standardization doesn't remove complexity. It makes complexity measurable. And once it's so, you can improve it deliberately instead of reacting to isolated incidents.
How many did you recognize?
At first, you’ll most probably notice one or two. That’s nothing unusual; every team goes through this before major issues show up. But when you start seeing four or five, here’s where cracks are about to show. Things still run, but it’s clear the process is bogged down and getting heavier than it should be. Hit six or seven, and it’s trouble. The next incident is probably going to be costly for your team. Not just from any bug, but because all that extra process is slowing everything to a crawl. The faster you notice this pattern, the better.
What to do about it
The good news is that you don’t have to dismantle your pipeline to fix any of this.
The main change here is splitting up the purpose of CI and the purpose of releasing. CI’s whole job is to build, test, and create an artifact. And it does that just fine. Everything after that (what version goes where, who approved it, and what happens if it fails) is a whole other responsibility that needs its own structure.
So, when you put this in practice, you end up with four key things:
- build the artifact once, then push it to each environment instead of rebuilding at every stage
- make approvals a real, tracked step, not just a message in Slack
- test your rollback process before you face an actual incident
- keep your release history somewhere safe, so it doesn’t disappear on your CI failure
You don't actually need a large platform. This thin layer between build and release, which owns the release decision, captures the audit trail and handles rollback as policy rather than improvisation, removing most of the ambiguity without much overhead.
I’ve seen this pattern enough times to know the structure problem only reveals itself under pressure. Everything is fine until it isn't. And when something breaks at a bad time, the question is not whether you can fix the bug, but whether you can understand what happened quickly enough to do something about it.