# My Backup Hadn't Run in 9 Days and Nothing Told Me

# My Backup Hadn't Run in 9 Days and Nothing Told Me

Leader 1 5 42
calendar_today agoschedule4 min read

My Backup Hadn't Run in 9 Days and Nothing Told Me

Level: Intermediate · Time: 5 min · Outcome: A cron command that can't hang forever

The backup ran fine every night for fourteen months, and then it didn't run for nine days, and nothing told me. No error in the log. No failed-job alert. No bounced cron mail. The nightly mysqldump had hung — the database had a long-held lock from a runaway analytics query, the dump opened its transaction and sat there waiting for it, forever.

Cron launched it at 2am, it never exited, and here's the cruel part: because I'd been smart enough to wrap it in a lock so two dumps couldn't run at once, every subsequent night's run saw the lock still held by the zombie from the 9th and skipped quietly. The clever lock turned a one-night hang into a nine-day outage. I found it when I went to restore a table and discovered my newest "backup" was a mysqldump process that had been running since the previous Tuesday.

That's nine days I'd have lost if anything had gone wrong with the live database. The ten minutes of feeling foolish when I traced it back was nothing next to that.

A hung command is worse than a failed one

This is the lesson worth internalizing, because it's counterintuitive. A failed command exits, frees its lock, and the next run tries again — the system self-heals. A hung command exits never. It holds resources, blocks its own future runs, and produces exactly zero signal because it never gets far enough to log anything. Failures are loud. Hangs are silent, and silence is what kills you in unattended automation.

You can't rely on a command to bound its own runtime, either. The whole problem is that it's wedged somewhere it can't time itself out of — blocked in the kernel waiting on a lock, or on a dead socket that will never send the FIN it's waiting for. So you bound it from the outside.

Bounding the runtime with timeout

#!/bin/bash
# Script: bounded-dump.sh
# Purpose: Stop a hung command from running forever and jamming the cron slot
set -euo pipefail

CHECK="✓"
CROSS="✗"

MAX_RUNTIME="5m"   # longer than the normal worst case, well under the interval
KILL_GRACE="20s"   # after SIGTERM, wait this long, then SIGKILL
DEST="/backup/mydb.sql"

# In an `if` so set -e doesn't abort before we read the exit code.
# Write to a .partial file so a timed-out run never leaves a corrupt "backup".
if timeout -k "$KILL_GRACE" "$MAX_RUNTIME" \
        mysqldump --single-transaction mydb > "$DEST.partial"; then
    mv "$DEST.partial" "$DEST"
    echo "$CHECK dump completed within $MAX_RUNTIME"
else
    code=$?
    rm -f "$DEST.partial"
    case "$code" in
        124) echo "$CROSS dump exceeded $MAX_RUNTIME — terminated (SIGTERM)" >&2 ;;
        137) echo "$CROSS dump ignored SIGTERM — force-killed (SIGKILL)" >&2 ;;
        *)   echo "$CROSS dump failed with exit code $code" >&2 ;;
    esac
    exit "$code"
fi

timeout -k "$KILL_GRACE" "$MAX_RUNTIME" mysqldump ... is the entire mechanism. At five minutes, timeout sends the dump a SIGTERM. A well-behaved program treats SIGTERM as "wrap up and exit." But the dump from my outage wasn't misbehaving — it was blocked in the kernel waiting on a lock, and a process in that state physically cannot act on SIGTERM. That's what -k 20s is for: twenty seconds after the polite signal, timeout sends SIGKILL, which the kernel enforces unconditionally. Nothing survives SIGKILL.

Read the exit code — it's the difference between a log and a mystery

124  # still running at the deadline — SIGTERM was sent
137  # 128 + 9 — ignored SIGTERM, had to be force-killed

Collapsing every outcome into "backup failed" throws away the one piece of information that tells you whether you have a slow database, a wedged one, or a broken dump command. A 124 says "this is taking too long — investigate the query." A 137 says "this is wedged in I/O — investigate the lock or the mount." They point at different problems. (If you ever blank on which code is which, the Bash Exit Code Lookup decodes 124 and 137 directly.)

The .partial trick matters more than it looks

If you redirect straight to the real backup file and the command times out mid-write, you've just replaced last night's good backup with a half-written, unrestorable file — and you won't know until the night you need it. Writing to a .partial path and mv-ing into place only on a clean exit means a failed or timed-out run leaves the previous good backup untouched. A mv on the same filesystem is atomic; the redirect is not.

For commands that talk to the network, layer the tool's own timeout underneath — curl --max-time, ssh -o ConnectTimeout, a net_read_timeout on the dump. Those fire first and fail cleanly. timeout is the outer hard stop for the night the inner one doesn't.

Back to the nine days

A timeout is what makes a lock safe. Locking a job to a single instance stops overlap, but a hang inside the locked job holds that lock forever — which is precisely how my nine-day gap happened. Bound the runtime and the lock always gets released, on a deadline, every time, and the failure becomes a loud 124 in the log instead of a silent gap you discover during a restore.

Full script with the exit-code branching and the FAQ on timing out a whole pipeline: https://bashsnippets.xyz/snippets/bash-timeout-command

The lock that made my hang invisible is flock, and the third guard is retry with backoff. The Hardened Cron Wrapper Generator composes all three, Bash Scripts That Survive Cron walks the whole decision, and the rest of the library is at https://bashsnippets.xyz

Originally published at bashsnippets.xyz

Part 14 of 14 in Bash Snippets Pages

1 Comment

0 votes
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script

BashSnippets - May 21

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

A Function Without local Overwrote My Variable and rm -rf Deleted the Wrong Directory

BashSnippets - Jun 21

# A Cron Job Took Our Server to Load 41 by Attacking Itself

BashSnippets - Jun 22

My Script Crashed and Left a Lock File Behind. Every Run After That Refused to Start.

BashSnippets - Jun 11
chevron_left
2.6k Points48 Badges
North Americabashsnippets.xyz
31Posts
24Comments
3Connections
Linux user who got tired of Googling the same bash commands every time I sat down at a terminal. Sta... Show more

Related Jobs

View all jobs →

Commenters (This Week)

2 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!