My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script

My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script

Leader 1 5 42
calendar_todayschedule4 min read

Nginx crashed on a Saturday night. An OOM kill, probably — I was running a Node app that leaked memory like a broken faucet. The service went down at 2:14 AM. I found out at 8:30 AM when I opened my laptop and saw Slack messages from six hours earlier asking why the site was down.

The fix took 10 seconds: sudo systemctl start nginx. The downtime cost me a weekend of credibility.

The thing is, systemctl already knows when a service dies. I just wasn't asking it to check. So I wrote a script that asks every 60 seconds and restarts the service if it's down. Took less time to write than it did to explain the outage to my team.


The Script

#!/bin/bash

CHECK="✓"
CROSS="✗"

# --- Configuration ---
SERVICE="nginx"                              # Change to your service name
LOG_FILE="/var/log/service-watchdog.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
NOTIFY_EMAIL=""                              # Optional: *Emails are not allowed*

# --- Check if service is running ---
if systemctl is-active --quiet "$SERVICE"; then
  echo "$CHECK [$DATE] $SERVICE is running"
else
  echo "$CROSS [$DATE] $SERVICE is NOT running — attempting restart..."
  echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE"

  # --- Attempt restart ---
  if sudo systemctl start "$SERVICE"; then
    echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE"

    # --- Optional: send email notification ---
    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \
        | mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL"
    fi
  else
    echo "$CROSS [$DATE] $SERVICE FAILED to restart — manual intervention needed" \
      | tee -a "$LOG_FILE"

    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE failed to restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \
        | mail -s "[CRITICAL] $SERVICE restart failed" "$NOTIFY_EMAIL"
    fi
  fi
fi

Why systemctl is-active --quiet and Not Something Else

I've seen people use ps aux | grep nginx for this. Don't. Here's why:

ps aux | grep nginx has a classic gotcha — the grep command itself shows up in the results because the word "nginx" is in the grep command line. People "fix" this with grep -v grep which works but is fragile and ugly. You're parsing process tables to answer a question that systemd already tracks natively.

`systemctl is-active --quiet "$SERVICE"` asks systemd directly: "is this unit in the active state?" The `--quiet` flag suppresses output and just returns an exit code. `0` means active. Anything else means it's not running. Clean, reliable, no string parsing. --- ## The Two-Level Failure Check This isn't just "is it down → restart it." There are two separate failure modes: **Level 1:** Is the service running? If yes, print the check mark and exit. No log noise, no wasted disk. **Level 2:** If the service is down and we try to restart it — did the restart actually work? `systemctl start` can fail for a dozen reasons: masked unit, broken config file, dependency that's also down, port already in use by something else. The script checks the exit code of the start command and sends a different email depending on whether recovery succeeded or failed. The `[RECOVERED]` email means the script fixed it and you can keep sleeping. The `[CRITICAL]` email means something is actually broken and you need to look at it. That distinction matters at 3 AM. --- ## Setting It Up with Cron ```bash crontab -e ``` Add this line: ```bash * * * * * /home/user/service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1 ``` That runs every single minute. Is that overkill? Maybe. But the script finishes in under 100ms when the service is healthy, and the alternative is 6 hours of downtime on a Saturday night. I'll take the overkill. **One gotcha with cron and sudo:** cron runs with a minimal environment and no terminal. If `sudo systemctl start` prompts for a password, it hangs silently forever. You need a sudoers rule: ```bash sudo visudo # Add this line: youruser ALL=(ALL) NOPASSWD: /bin/systemctl start nginx ``` Or just run the watchdog cron as root. --- ## What Else I Watch With This The `SERVICE` variable takes any systemd unit name. I run separate copies for: - `nginx` — the web server - `mysql` or `mariadb` — the database - `docker` — the container daemon - Custom services: `my-node-app.service`, `redis-server`, `postgresql` If you want to watch multiple services in one script, loop through them: ```bash SERVICES=("nginx" "mysql" "redis-server") for SERVICE in "${SERVICES[@]}"; do
# ... same check logic ...
done


But I prefer separate scripts per service because the logs stay clean and each one can have a different notification strategy.

---

## Pairing This With Other Scripts

This watchdog handles the restart. But if you also want to know *why* the service died, pair it with:

- [Monitor CPU & RAM Usage](https://bashsnippets.xyz/snippets/monitor-cpu-ram-usage.html) — catches the OOM conditions that kill services in the first place
- [Send Email Alert from Bash](https://bashsnippets.xyz/snippets/bash-send-email-alert.html) — the email sending setup if you've never configured `mail` on Linux

Between these three scripts, you've got a basic monitoring stack that runs entirely on cron and costs nothing.

---

Full script, the line-by-line breakdown, cron setup walkthrough, and three more variations:

**[bashsnippets.xyz/snippets/restart-service-if-stopped.html](https://bashsnippets.xyz/snippets/restart-service-if-stopped.html)**

If you're managing any Linux server with services that need to stay up, this takes 5 minutes to deploy and runs quietly forever.
Part 1 of 15 in Bash Snippets Pages

2 Comments

2 votes
2
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

Karol Modelskiverified - Apr 23

I Lost a Client's Database on a $5 VPS. Here's the 12-Line Script That Would Have Saved It.

BashSnippets - May 20

Your AI Agent Skills Have a Version Control Problem

snapsynapseverified - Apr 22
chevron_left
2.7k Points48 Badges
North Americabashsnippets.xyz
32Posts
26Comments
3Connections
Linux user who got tired of Googling the same bash commands every time I sat down at a terminal. Sta... Show more

Related Jobs

View all jobs →

Commenters (This Week)

2 comments
2 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!