My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script

My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script

Leader 1 5 32
calendar_todayschedule4 min read

Nginx crashed on a Saturday night. An OOM kill, probably — I was running a Node app that leaked memory like a broken faucet. The service went down at 2:14 AM. I found out at 8:30 AM when I opened my laptop and saw Slack messages from six hours earlier asking why the site was down.

The fix took 10 seconds: sudo systemctl start nginx. The downtime cost me a weekend of credibility.

The thing is, systemctl already knows when a service dies. I just wasn't asking it to check. So I wrote a script that asks every 60 seconds and restarts the service if it's down. Took less time to write than it did to explain the outage to my team.


The Script

#!/bin/bash

CHECK="✓"
CROSS="✗"

# --- Configuration ---
SERVICE="nginx"                              # Change to your service name
LOG_FILE="/var/log/service-watchdog.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
NOTIFY_EMAIL=""                              # Optional: *Emails are not allowed*

# --- Check if service is running ---
if systemctl is-active --quiet "$SERVICE"; then
  echo "$CHECK [$DATE] $SERVICE is running"
else
  echo "$CROSS [$DATE] $SERVICE is NOT running — attempting restart..."
  echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE"

  # --- Attempt restart ---
  if sudo systemctl start "$SERVICE"; then
    echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE"

    # --- Optional: send email notification ---
    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \
        | mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL"
    fi
  else
    echo "$CROSS [$DATE] $SERVICE FAILED to restart — manual intervention needed" \
      | tee -a "$LOG_FILE"

    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE failed to restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \
        | mail -s "[CRITICAL] $SERVICE restart failed" "$NOTIFY_EMAIL"
    fi
  fi
fi

Why systemctl is-active --quiet and Not Something Else

I've seen people use ps aux | grep nginx for this. Don't. Here's why:

ps aux | grep nginx has a classic gotcha — the grep command itself shows up in the results because the word "nginx" is in the grep command line. People "fix" this with grep -v grep which works but is fragile and ugly. You're parsing process tables to answer a question that systemd already tracks natively.

systemctl is-active --quiet "$SERVICE" asks systemd directly: "is this unit in the active state?" The --quiet flag suppresses output and just returns an exit code. 0 means active. Anything else means it's not running. Clean, reliable, no string parsing.


The Two-Level Failure Check

This isn't just "is it down → restart it." There are two separate failure modes:

Level 1: Is the service running? If yes, print the check mark and exit. No log noise, no wasted disk.

Level 2: If the service is down and we try to restart it — did the restart actually work? systemctl start can fail for a dozen reasons: masked unit, broken config file, dependency that's also down, port already in use by something else. The script checks the exit code of the start command and sends a different email depending on whether recovery succeeded or failed.

The [RECOVERED] email means the script fixed it and you can keep sleeping. The [CRITICAL] email means something is actually broken and you need to look at it. That distinction matters at 3 AM.


Setting It Up with Cron

crontab -e

Add this line:

* * * * * /home/user/service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1

That runs every single minute. Is that overkill? Maybe. But the script finishes in under 100ms when the service is healthy, and the alternative is 6 hours of downtime on a Saturday night. I'll take the overkill.

One gotcha with cron and sudo: cron runs with a minimal environment and no terminal. If sudo systemctl start prompts for a password, it hangs silently forever. You need a sudoers rule:

sudo visudo
# Add this line:
youruser ALL=(ALL) NOPASSWD: /bin/systemctl start nginx

Or just run the watchdog cron as root.


What Else I Watch With This

The SERVICE variable takes any systemd unit name. I run separate copies for:

  • nginx — the web server
  • mysql or mariadb — the database
  • docker — the container daemon
  • Custom services: my-node-app.service, redis-server, postgresql

If you want to watch multiple services in one script, loop through them:

SERVICES=("nginx" "mysql" "redis-server")
for SERVICE in "${SERVICES[@]}"; do
  # ... same check logic ...
done

But I prefer separate scripts per service because the logs stay clean and each one can have a different notification strategy.


Pairing This With Other Scripts

This watchdog handles the restart. But if you also want to know why the service died, pair it with:

Between these three scripts, you've got a basic monitoring stack that runs entirely on cron and costs nothing.


Full script, the line-by-line breakdown, cron setup walkthrough, and three more variations:

bashsnippets.xyz/snippets/restart-service-if-stopped.html

If you're managing any Linux server with services that need to stay up, this takes 5 minutes to deploy and runs quietly forever.

Part 1 of 7 in Bash Snippets Pages

2 Comments

2 votes
2
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

Karol Modelskiverified - Apr 23

I Lost a Client's Database on a $5 VPS. Here's the 12-Line Script That Would Have Saved It.

BashSnippets - May 20

Your AI Agent Skills Have a Version Control Problem

snapsynapseverified - Apr 22
chevron_left
2.2k Points38 Badges
North Americabashsnippets.xyz
22Posts
24Comments
3Followers
3Connections
Linux user who got tired of Googling the same bash commands every time I sat down at a terminal. Sta... Show more

Related Jobs

View all jobs →

Commenters (This Week)

2 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!