One simple watchdog solution is to use a slightly modified apachectlscript, which we have called apache.watchdog. Call it from cron every 30 minutes—or even every minute—to make sure that the server is always up.

The crontab entry for 30-minute intervals would read:

5,35 * * * * /path/to/the/apache.watchdog >/dev/null 2>&1

The script is shown in Example 5-8.

Example 5-8. apache.watchdog

  --------------------
#!/bin/sh

# This script is a watchdog checking whether
# the server is online.
# It tries to restart the server, and if it is
# down it sends an email alert to the admin.

# admin's email
EMAIL=webmaster@example.com

# the path to the PID file
PIDFILE=/home/httpd/httpd_perl/logs/httpd.pid

# the path to the httpd binary, including any options if necessary
HTTPD=/home/httpd/httpd_perl/bin/httpd_perl

# check for pidfile
if [ -f $PIDFILE ] ; then
    PID=`cat $PIDFILE`

    if kill -0 $PID; then
        STATUS="httpd (pid $PID) running"
        RUNNING=1
    else
        STATUS="httpd (pid $PID?) not running"
        RUNNING=0
    fi
else
    STATUS="httpd (no pid file) not running"
    RUNNING=0
fi

if [ $RUNNING -eq 0 ]; then
    echo "$0 $ARG: httpd not running, trying to start"
    if $HTTPD ; then
        echo "$0 $ARG: httpd started"
        mail $EMAIL -s "$0 $ARG: httpd started" \
             < /dev/null > /dev/null 2>&1
    else
        echo "$0 $ARG: httpd could not be started"
        mail $EMAIL -s "$0 $ARG: httpd could not be started" \
             < /dev/null > /dev/null 2>&1
    fi
fi

Another approach is to use the Perl LWP module to test the server by trying to fetch a URI served by the server. This is more practical because although the server may be running as a process, it may be stuck and not actually serving any requests—for example, when there is a stale lock that all the processes are waiting to acquire. Failing to get the document will trigger a restart, and the problem will probably go away.

We set a cron job to call this LWP script every few minutes to fetch a document generated by a very light script. The best thing, of course, is to call it every minute (the finest resolution cron provides). Why so often? If the server gets confused and starts to fill the disk with lots of error messages written to the error_log, the system could run out of free disk space in just a few minutes, which in turn might bring the whole system to its knees. In these circumstances, it is unlikely that any other child will be able to serve requests, since the system will be too busy writing to the error_log file. Think big—if running a heavy service, adding one more request every minute will have no appreciable impact on the server's load.

So we end up with a crontab entry like this:

* * * * * /path/to/the/watchdog.pl > /dev/null

The watchdog itself is shown in Example 5-9.

Example 5-9. watchdog.pl

#!/usr/bin/perl -Tw

# These prevent taint checking failures
$ENV{PATH} = '/bin:/usr/bin';
delete @ENV{qw(IFS CDPATH ENV BASH_ENV)};

use strict;
use diagnostics;

use vars qw($VERSION $ua);
$VERSION = '0.01';

require LWP::UserAgent;

###### Config ########
my $test_script_url = 'http://www.example.com:81/perl/test.pl';
my $monitor_email   = 'root@localhost';
my $restart_command = '/home/httpd/httpd_perl/bin/apachectl restart';
my $mail_program    = '/usr/lib/sendmail -t -n';
######################

$ua  = LWP::UserAgent->new;
$ua->agent("$0/watchdog " . $ua->agent);
# Uncomment the following two lines if running behind a firewall
# my $proxy = "http://www-proxy";
# $ua->proxy('http', $proxy) if $proxy;

# If it returns '1' it means that the service is alive, no need to
# continue
exit if checkurl($test_script_url);

# Houston, we have a problem.
# The server seems to be down, try to restart it.
my $status = system $restart_command;

my $message = ($status =  = 0)
            ? "Server was down and successfully restarted!"
            : "Server is down. Can't restart.";

my $subject = ($status =  = 0)
            ? "Attention! Webserver restarted"
            : "Attention! Webserver is down. can't restart";

# email the monitoring person
my $to = $monitor_email;
my $from = $monitor_email;
send_mail($from, $to, $subject, $message);

# input:  URL to check 
# output: 1 for success, 0 for failure
#######################
sub checkurl {
    my($url) = @_;

    # Fetch document 
    my $res = $ua->request(HTTP::Request->new(GET => $url));

    # Check the result status
    return 1 if $res->is_success;

    # failed
    return 0;
}

# send email about the problem 
#######################  
sub send_mail {
    my($from, $to, $subject, $messagebody) = @_;

    open MAIL, "|$mail_program"
        or die "Can't open a pipe to a $mail_program :$!\n";

    print MAIL <<_ _END_OF_MAIL_ _;
To: $to
From: $from
Subject: $subject

$messagebody

--
Your faithful watchdog

_ _END_OF_MAIL_ _

    close MAIL or die "failed to close |$mail_program: $!";
}

Of course, you may want to replace a call to sendmail with Mail::Send, Net::SMTP code, or some other preferred email-sending technique.