Posts Tagged ‘Redhat’

Simple IP Failover

Thursday, August 13th, 2009

At work, we have a few virtual machines which are part of some sort of cluster.  Some are active/active and some are active/passive.  Some are load balancers and some are webservers.  Using clustering or IP failover for high availability is a great.  It's much easier to update nodes one at a time without having to schedule downtime or cause noticeable impact to the end user.  In the past I've been using the Linux-HA software.  It's full featured but very complicated.

Recently, I've been working on moving one of our services from Redhat Enterprise Linux based virtual machines to Ubuntu Linux.  Redhat has always served us pretty well, but some of the project requirements included newer versions of software than what was available in the lastest distribution.  These requirements were met with the latest long term release of Ubuntu, which is now over a year old.  I appreciate the timely release schedule that Ubuntu uses as well as the inclusion of the latest versions of various packages.  But, I'm getting off topic here.  I was utilizing the Linux-HA software with Redhat to run an active/active cluster.  Each cluster node handled http requests directly (one hostname had a couple IP addresses associated).  This worked well, but the Linux-HA software wasn't fun to manage.  I didn't use any front end tools, just edited the XML files and loaded them.  My other complaint was that the requests were not properly balanced over all the nodes using the DNS round robin approach.

So the new implementation now has redundant backend workers (running Tomcat), with a single Apache load balancer on the front end.  The Apache load balancer works as a reverse proxy and gracefully handles conditions where workers stop responding.  The load is appropriately dispersed between the workers and I am extremely pleased with the results.

But, there is one problem.  The Apache load balancer isn't highly available.  I didn't want to set up the Linux-HA software again, so I started looking around for a more simple solution (think KISS).  I soon found this blog post and it was exactly what I was looking for.

After reading the article, I decided that I would like to write a perl script that would use lock files and daemonize instead of a shell script.  I had just done another script that did that and was very happy with how it worked.  After putting together the script and doing some testing, I decided I needed two scripts.  One for the active node and one for the standby node.  The basic idea is that the standby node checks the active node to see that it is up and running.  If it detects a failure, it will bring up the service IP address, send the arp packet, and restart the Apache daemon (so it sees all IP addresses).  When the standby node detects the primary is back on the network, it shuts down the service IP address.  On the primary node, it will check the default gateway to determine if it is up on the network.  If it detects a failure, it will shutdown the service IP address.  When it resumes network connectivity, it will add the service IP, send the arp packet, and restart the apache service.

So the two scripts are below.  First the one that runs on the primary and second is the one that runs on the standby node.  I hope to use these with little modifications for all our applications.  Some have multiple IP addresses for IP based virtualhosting (SSL sites).  I also installed the fake package to utilize the send_arp program. I will probably need to make some more revisions, but I thought this might be helpful to other people out there trying to accomplish the same thing. These scripts come "as-is" with no warranty what so ever. Feel free to do what you'd like with them. If anyone has better solutions, feel free to post a comment!

EDIT (19-August-2009): I had some issues with a race condition in one of the scripts. The script that brought the IP address up on the primary node had an issue with the ping counter. I decided to just run one script on the standby node to simplify things. On the primary, I scheduled a cronjob to run the send_arp command every minute. This will send an arp packet to update the tables on the router when it's on-line. I've also made some slight modifications to the script that runs on the backup host.

EDIT (28-March-2010): I've taken down my git repository and I'm including the script below.

#!/usr/bin/perl -w
#
# ipfaild.pl
#
# Daemon to handle IP failover and restart necessary services.
#
# 11-Aug-2009 - Patrick Hennessy
#
use strict;
use Sys::Syslog;
use POSIX qw(setsid);
use Fcntl ':flock';
use Net::Ping::External qw(ping);

# Vars
#
my $pid;
my $progname = "ipfaild";
my $daemon_pidfile = "/var/run/$progname/$progname.pid";
my $daemon_lockfile = "/var/run/$progname/$progname.lock";
my $log_facility = "LOG_DAEMON";
my $ifconfig = '/sbin/ifconfig';
my $send_arp = '/usr/sbin/send_arp';
my $apache2ctl = '/usr/sbin/apache2ctl';
my $ping_timeout = 1;
my $sleep_time = 2;
my $missed = 0;
my $ipRec;

my $otherHost = 'otherhost.domain.com';
my $thisMAC = '00:11:22:33:44:55';

my @ipRecords = (
        { name=> 'fooservice', pubip => '192.168.1.200', dev => 'eth0:10', mask => '255.255.255.0' },
);

# Subroutines
#
sub daemonize;

# Daemonize process
#
daemonize;

# Acquire exclusive lock
#
open LOCKFILE, ">$daemon_lockfile" or die "$progname: can't write to $daemon_lockfile: $!n";
flock(LOCKFILE, LOCK_EX | LOCK_NB) or die "$progname: can't acquire lock: $daemon_lockfile: $!n";
print LOCKFILE "$pidn";

# Open syslog.
#
openlog($progname, "pid", $log_facility);

# Signal handlers
#
my $keep_processing = 1;
$SIG{HUP}  = sub { syslog("info", "Caught SIGHUP:  exiting gracefully"); $keep_processing = 0; };
$SIG{INT}  = sub { syslog("info", "Caught SIGINT:  exiting gracefully"); $keep_processing = 0; };
$SIG{QUIT}  = sub { syslog("info", "Caught SIGQUIT:  exiting gracefully"); $keep_processing = 0; };
$SIG{TERM}  = sub { syslog("info", "Caught SIGTERM:  exiting gracefully"); $keep_processing = 0; };

# Bring down interfaces.
#
for $ipRec (@ipRecords) {
        syslog("info", "Running: $ifconfig $ipRec->{'dev'} down");
        system($ifconfig, $ipRec->{'dev'}, 'down') == 0
                or syslog("info", "Error: $? Could not run: $ifconfig $ipRec->{'dev'} down");
}

# Main loop
#
while ($keep_processing) {
        # Ping the other host and count dropped packets
        #
        if (! ping(host => $otherHost, timeout => $ping_timeout)) {
                $missed++;
        } else {
                if ($missed > 2) {
                        for $ipRec (@ipRecords) {
                                syslog("info", "Running: $ifconfig $ipRec->{'dev'} down");
                                system($ifconfig, $ipRec->{'dev'}, 'down') == 0
                                        or syslog("info", "Error: $? Could not run: $ifconfig $ipRec->{'dev'} down");
                        }
                        $missed = 0;
                }
        }

        # Bring up IP addresses if packets dropped
        #
        if ($missed == 2) {
                for $ipRec (@ipRecords) {
                        syslog("info", "Running: $ifconfig $ipRec->{'dev'} $ipRec->{'pubip'} netmask $ipRec->{'mask'}");
                        system($ifconfig, $ipRec->{'dev'}, $ipRec->{'pubip'}, 'netmask', $ipRec->{'mask'}) == 0
                                or syslog("info", "Error: $? Could not run: $ifconfig $ipRec->{'dev'} $ipRec->{'pubip'} netmask $ipRec->{'mask'}");
                        syslog("info", "Running: $send_arp $ipRec->{'pubip'} $thisMAC $ipRec->{'pubip'} ff:ff:ff:ff:ff:ff");
                        system($send_arp, $ipRec->{'pubip'}, $thisMAC, $ipRec->{'pubip'}, 'ff:ff:ff:ff:ff:ff') == 0
                                or syslog("info", "Error: $? Could not run: $send_arp $ipRec->{'pubip'} $thisMAC $ipRec->{'pubip'} ff:ff:ff:ff:ff:ff");
                }
                syslog("info", "Running: $apache2ctl restart");
                system($apache2ctl, 'restart') == 0
                        or syslog("info", "Error: $? Could not run: $apache2ctl restart");
        }

        # Sleep
        #
        sleep($sleep_time);
}

# Bring down interfaces.
#
for $ipRec (@ipRecords) {
        syslog("info", "Running: $ifconfig $ipRec->{'dev'} down");
        system($ifconfig, $ipRec->{'dev'}, 'down') == 0
                or syslog("info", "Error: $? Could not run: $ifconfig $ipRec->{'dev'} down");
}

# Close syslog.
#
closelog();

# Close lockfile.
close (LOCKFILE);

# Exit.
#
exit(0);

# Functions
#
sub daemonize() {
        open STDIN, '/dev/null' or die "$progname: can't read /dev/null: $!";
        open STDOUT, '>/dev/null' or die "$progname: can't write to /dev/null: $!";
        defined(my $pid = fork) or die "$progname: can't fork: $!";
        if($pid) {
                # parent
                open PIDFILE, ">$daemon_pidfile" or die "$progname: can't write to $daemon_pidfile: $!n";
                print PIDFILE "$pidn";
                close(PIDFILE);
                exit;
        }
        # child
        setsid or die "$progname: can't start a new session: $!";
        open STDERR, '>&STDOUT' or die "$progname: can't dup stdout: $!";
}

css.php