
                   Getting Started with Linux-HA (heartbeat)
                                       
Intro

   Let me preface this document by saying most of this is _not_ original
   work.  My purpose for writing this document is just trying to
   contribute in some way to possibly help those who REALLY get things
   done.  The "work" I am contributing is mostly compiling bits and
   pieces from other HA documents (such as Volker Wiegand's Hardware
   Installation Guide) into a document that can help novices get started
   on HA without pestering Alan (like I did!) and to cut down on repeat
   questions on the mailing list.
   
   
Getting Started

   The first thing you'll need is two computers.  You need not have
   identical hardware in both machines (or amount of memory, etc.), but
   if you did, it would make your life that much easier when a component
   fails.
   
   Now you have to decide on some of your implementation.  Your "cluster"
   is established via a "heartbeat" between the two computers (nodes)
   generated by the software package of the same name.  However, this
   heartbeat needs one or more media paths between the nodes.
   
   At this point, you're actually ready to begin hardware-wise.  Of
   course, since you're looking into HA, you'll mostly likely want to
   avoid having only one point of failure.  In this case, that would be
   your null modem cable or serial port.  So, you need to decide whether
   you wish to add a second serial/null modem connection or a second
   network interface card (NIC) to each node connected via a crossover
   cable.  See Appendix A for instructions on how to build a crossover
   cable.  My setup goes the 2 NIC route because I only had one null
   modem cable, had plenty of NICs on hand and thought it was good to
   have two medium types for the heartbeats.
   
   Once your hardware is in order, you must install your OS and configure
   your networking (I used Red Hat).  Assuming you have 2 NICs, one
   should be configured for your "normal" network and the other as a
   private network between your clustered nodes (via the crossover
   cable).  For an example, we will assume that our cluster will have the
   following addresses:
   
   Node 1 (linuxha1):   192.168.85.1  (normal 192x net)
                        10.0.0.1 (private 10x net)
   Node 2 (linuxha2):   192.168.85.2  (192x)
                        10.0.0.2  (10x)
   Note:  Neither of these addresses should be your "cluster address" -
   the address handled by heartbeat and failed over between nodes!
   
     Red Hat makes this easy during installation (please don't think I'm
   carrying their banner, it's just what I use), however, if you use
   another distribution or are having any problems, refer to the Ethernet
   HOWTO.     To check your configuration, type:
   
            ifconfig
   
   This will show your network interfaces and their configuration.  You
   can obtain the same information in a less verbose form from "netstat
   -nr"
   
   If it looks good, make sure you can ping between both nodes on all
   interfaces.
   
   Next, you need to test your serial connection.  On one node, which
   will be the receiver, type:
              cat </dev/ttyS0
   
   On the other node, type,:
              echo hello >/dev/ttyS0
   
   You should see the text on the receiver node.  If it works, change
   their roles and try again.  If it doesn't, it may be as simple as
   having the wrong device file.  Volker's HA Hardware Guide and the
   Serial HOWTO are two good resources for troubleshooting your serial
   connection.
   
Installing Heartbeat.

   You can now install the heartbeat package.  If you're reading this,
   you already have it, but in any case it's available at:
   
          [1]http://linux-ha.org/download
   
   Untar it into your favorite source directory.   The RPM version is
   available at the web site, or make your own RPM version, type "make
   rpm" and use rpm to install.  Otherwise, you can simply type "make
   install".
   
Configuring Heartbeat

   Configuring ha.cf
   There are three files you will need to configure before starting up
   heartbeat.  First, is ha.cf.  This will be placed in the /etc/ha.d
   directory that is created after installation.  It tells heartbeat what
   types of media paths to use and how to configure them.   The ha.cf in
   the source directory contains all the various options you can use,
   I'll go through it line by line...
   
   serial /dev/ttyS0
                 -->  Mandatory.  Replace /dev/ttyS0 with the appropriate
          dev file for your required serial heartbeat.
          
   watchdog /dev/watchdog
          Optional.  The watchdog function provides a way to have a
          system that is still minimally functioning, but not providing a
          heartbeat, reboot itself after a minute of being sick.  This
          could help to avoid a scenario where the machine recovers its
          heartbeat after being pronounced dead.  If that happened and a
          disk mount failed over, you could have two nodes mounting a
          disk simultaneously. If you wish to use this feature, then in
          addition to this line, you will need to load the "softdog"
          kernel module and create the actual device file.  To do this,
          first type "insmod softdog" to load the module. Then, type
          "grep misc /proc/devices" and note the number it reports
          (should be 10).  Next, type "cat /proc/misc | grep watchdog"
          and note that number (should be 130).  Now you can create the
          device file with that info typing, "mknod /dev/watchdog c 10
          130".
          
   udp eth1
          Specifies to use a udp heartbeat over the 10x eth1 interface
          (replace with eth0, eth1, or whatever you use).
          
   keepalive 2
          Sets the time between heartbeats to 2 seconds.
          
   deadtime 10
          Node is pronounced dead after 10 seconds.
          
   hopfudge 1
          Optional.  For ring topologies, number of hops allowed in
          addition to the number of nodes in the cluster.
          
   baud 19200
          Speed at which to run the serial line (bps).
          
   udpport 694
          Use port number 694 for udp. This is the default, and the
          official registered port number.
          
   nice_failback on
          
        Optional.  For those familiar with Tru64 Unix, heartbeat acts as
                if in "favored member"mode.  The master holds all the
                resources until a failover, at which time the slave takes
                over.  Once the master comes back online, it will take
                everything back from the slave.  This option will prevent
                the master node from re-acquiring cluster resources after
                a failover.
                
   node linuxha1.linux-ha.org
          Mandatory.  Hostname of machine in cluster as described by
          `uname -a`.
          
   node linuxha2.linux-ha.org
          Mandatory.  Hostname of machine in cluster as described by
          `uname -a`.
          
   Configuring haresources
   Once you've got your ha.cf set up, you need to configure haresources.
   This file specifies the services for the cluster and who the default
   owner is.
   Note:  This file must be the same on both nodes!
   
   For our example, we'll assume the high availability services are
   Apache and Samba.  The IP for the cluster is mandatory, and don't
   configure the cluster IP outside of the haresources file!.  The
   haresources will need one line:
                  linuxha1.linux-ha.org 192.168.85.3 httpd smb

   So, this line dictates that on startup, have linuxha1 serve the IP
   192.168.85.3 and start apache and samba as well.
   On shutdown, heartbeat will first stop smb, then apache, then give up
   the IP.  This assumes that the command "uname -n" spits out
   "linuxha1.linux-ha.org" - yours may well produce "linuxha1" and if it
   does, use that instead!
   
   Note:  httpd and smb are the name of startup scripts for Apache and
   Samba, respectively.  Heartbeat will look for startup scripts of the
   same name in the following paths:
       /etc/ha.d/resource.d
       /etc/rc.d/init.d
   
   These scripts must start services via "scriptname start" and stop them
   via "scriptname stop".
   So you can use any services as long as they conform to the above
   standard.
   
   Should you need to pass arguments to a custom script, the format would
   be:
                scriptname::argument

   So, if we added a service "maid" which needed the argument "vacuum",
   our haresources line would modify to the following:
                linuxha1 192.168.85.3 httpd smb maid::vacuum

   This brings us to some added flexibility with the service IP address.
   We are actually using a shorthand notation above.  The actual line
   could have read (we've canned the maid):
                linuxha1 IPaddr::192.168.85.3 httpd smb

   Where IPaddr is the name of our service script, taking the argument
   192.168.85.3.  Sure enough, if you look in the directory
   /etc/ha.d/resource.d, you will find a script called IPaddr.  This
   script will also allow you to manipulate the netmask and broadcast
   address of this IP service.  To specify a subnet with 32 addresses,
   you could define the service as (leaving off the IPaddr because we
   can!):
                linuxha1 192.168.85.3/27 httpd smb

   This sets the IP service address to 192.168.85.3, the netmask to
   255.255.255.224 and the broadcast address would default to
   192.168.85.31 (which is the highest address on the subnet).  The last
   parameter you can set is the broadcast address.  To override the
   default  and set it to 192.168.85.16, your entry would read:
                linuxha1 192.168.85.3/5/192.168.85.16 httpd smb

   You may be wondering whether any of the above is necessary for you.
   It depends.  If you've properly established a net route (independent
   of heartbeat) for the service's IP address, with the correct netmask
   and broadcast address, then no, it's not necessary for you.  However,
   this case won't fit everybody and that's why the option's there!  In
   addition, you may have more than one possible interface that could be
   used for the service IP.  Read on to see how heartbeat treats this...
   
   Once you straighten out your haresources file, copy ha.cf and
   haresources to /etc/ha.d and you're ready to start!
   
Selecting an Interface

   One important aspect of configuring the haresources file for a machine
   which has multiple ethernet interfaces is to know how heartbeat
   selects which interface will wind up supporting the service addresses
   that are configured in haresources.  After all, no interface was
   specified in the haresources file.
   
   Heartbeat decides which interface will be used by looking at the
   routing table.  It tries to select the lowest cost route to the IP
   address to be taken over.  In the case of a tie, it chooses the first
   route found.  For most configurations this means the default route
   will be least preferred.
   
   If you don't specify a netmask for the IP address in the haresources
   file, the netmask associated with the selected route will be used.
   
   Configuring Authkeys
   
   The third file to configure determines your authentication keys.
   There are three types of authentication methods available:  crc, md5,
   and sha1.  "Well, which should I use?", you ask.  Since this document
   is called "Getting Started", we'll keep it simple......
   
   If your heartbeat runs over a secure network, such as the crossover
   cable in our example, you'll want to use crc.  This is the cheapest
   method from a resources perspective.  If the network is insecure, but
   you're either not very paranoid or concerned about minimizing CPU
   resources, use md5.  Finally, if you want the best authentication
   without regard for CPU resources, use sha1.  It's the hardest to
   crack.
   
   The format of the file is as follows:
   auth <number>
   <number> <authmethod> [<authkey>]
   
   SO, for sha1, a sample /etc/ha.d/authkeys could be:
   auth 1
   1 sha1 key-for-sha1-any-text-you-want
   
   For md5, you could use the same as the above, but replace "sha1" with
   "md5".
   
   Finally, for crc, a sample can be:
   auth 2
   1 crc
   
   Make sure its permissions are safe, like 600.  And "any text you want"
   is not quite right.  There's a limit to the number of characters you
   can use.
   That's it!
   
Starting and testing heartbeat

   From Red Hat, or other distributions which use the SystemV style init
   files, simply type /etc/rc.d/init.d/heartbeat start on both nodes.  I
   would recommend starting on the system master (in our example
   linuxha1) first.
   
   If you want heartbeat to run on startup, what to do will differ on
   your distribution.  For Red Hat (again, sorry) and Mandrake, you will
   need to place links to the startup script in the appropriate init
   level directories.  I have heartbeat start last and only care about
   the 0(halt), 6(reboot), 3(text-only), 5(X) run levels.
   So, I needed to type in the following (as root, of course):
   
       cd /etc/rc.d/rc0.d ; ln -s ../init.d/heartbeat K01heartbeat
       cd /etc/rc.d/rc3.d ; ln -s ../init.d/heartbeat S99heartbeat
       cd /etc/rc.d/rc5.d ; ln -s ../init.d/heartbeat S99heartbeat
       cd /etc/rc.d/rc6.d ; ln -s ../init.d/heartbeat K01heartbeat
   
   The last time I ran slackware, there was no /etc/rc.d/init.d directory
   (may have changed by now) and to do the same thing, I would have
   placed in /etc/rc.d/rc.local:
       /etc/ha.d/heartbeat start
   ***This assumes you copy the file ha.rc to /etc/ha.d/heartbeat.  If
   you can't find /etc/rc.d/init.d with your distribution and you're
   unsure of how processes start, you can use the rc.local method.  But
   you're on your own for shutdown, I just don't remember...
   
   Note:  If you use the watchdog function, you'll need to load its
   module at bootup as well.  For Red Hat, I put the following command at
   the bottom of the /etc/rc.d/rc.sysinit file:
       /sbin/insmod softdog
   For the rc.local method, just put the same line right above where you
   start heartbeat.
   
   Once you've started heartbeat, take a peek at your log file (default
   is /var/log/ha-log) before testing it.  If all is peachy, the service
   owner's log (linuxha1 in our example) should look something like this:
   heartbeat: 2000/01/18_14:26:45 info: Neither logfile nor logfacility
   found.
   heartbeat: 2000/01/18_14:26:45 info: Defaulting to /var/log/ha-log
   heartbeat: 2000/01/18_14:26:45 info: ***********************
   heartbeat: 2000/01/18_14:26:45 info: Configuration validated. Starting
   heartbeat.
   heartbeat: 2000/01/18_14:26:46 notice: Starting serial heartbeat on
   tty /dev/ttyS0
   heartbeat: 2000/01/18_14:26:46 notice: UDP heartbeat started on port
   694 interface eth1
   heartbeat: 2000/01/18_14:26:46 notice: Using watchdog device:
   /dev/watchdog
   heartbeat: 2000/01/18_14:26:46 error: Cannot open /proc/ha/.control:
   No such file or directory
   heartbeat: 2000/01/18_14:26:56 warn: node linuxha2.linux-ha.org: is
   dead
   heartbeat: 2000/01/18_14:26:56 INFO: Running /etc/ha.d/rc.d/status
   status
   heartbeat: 2000/01/18_14:26:57 info: Requesting our resources.
   heartbeat: 2000/01/18_14:26:58 INFO: Running
   /etc/ha.d/resource.d/IPaddr 192.168.85.3 status
   heartbeat: 2000/01/18_14:26:58 INFO: Running /etc/ha.d/rc.d/ip-request
   ip-request
   heartbeat: 2000/01/18_14:27:00 info: node linuxha2.linux-ha.org:
   status up
   heartbeat: 2000/01/18_14:27:00 INFO: Running /etc/ha.d/rc.d/status
   status
   heartbeat: 2000/01/18_14:27:28 Acquiring resource group:
   linuxha1.linux-ha.org 192.168.85.3 httpd smb mirror
   heartbeat: 2000/01/18_14:27:28 INFO: Running
   /etc/ha.d/resource.d/mirror  start
   heartbeat: 2000/01/18_14:27:29 INFO: Running /etc/rc.d/init.d/smb
   start
   heartbeat: 2000/01/18_14:27:30 INFO: Running /etc/rc.d/init.d/httpd
   start
   heartbeat: 2000/01/18_14:27:31 INFO: Running
   /etc/ha.d/resource.d/IPaddr 192.168.85.3 start
   heartbeat: 2000/01/18_14:27:32 INFO: ifconfig eth0:0 192.168.85.3
   netmask 255.255.255.0 broadcast 192.168.85.255
   heartbeat: 2000/01/18_14:27:32 Sending Gratuitous Arp for 192.168.85.3
   on eth0:0 [eth0]
   NOTE:  Your log may differ depending on when you started heartbeat on
   linuxha2!!!  I waited just over 10 seconds.
                   _____________________________________
   
   OK, now try to ping your cluster's IP (192.168.85.3 in the example).
   If this works, telnet to it and verify you're on linuxha1.
   Next, make sure your services are tied to the .3 address.  Bring up
   netscape and type in 192.168.85.3 for the URL.  For Samba, try to map
   the drive "\\192.168.85.3\test"  assuming you set up a share called
   "test".  See Samba docs to get that going.  As an aside, however,
   you'll want to use the "netbios name" parameter to have your Samba
   share listed under the cluster name and not the hostname of your
   cluster member!
   
   NOTE: If you can't bring up the service IP address and you get ha-log
   entries similar to this:
   
             SIOCSIFADDR: No such device
             SIOCSIFFLAGS: No such device
             SIOCSIFNETMASK: No such device
             SIOCSIFBRDADDR: No such device
             SIOCSIFFLAGS: No such device
             SIOCADDRT: No such device
     
     It may mean that you need to enable IP aliasing in your kernel
     build.  Check /usr/src/linux/.config for "CONFIG_IP_ALIAS=y" if you
     don't have it, you'll have the line "CONFIG_IP_ALIAS is not set".
     Rebuild your kernel with IP aliasing enabled.
     
   If this all works, you've got availability.  Now let's see if we have
   High Availability :-)
   
   Take down linuxha1.  Kill power, kill heartbeat, whatever you have the
   stomach for, but don't just yank both the serial and eth1 heartbeat
   cables.  If you do that, you'll have services running on both nodes
   and when you re-connect the heartbeat, a bit of chaos....
   Now ping the cluster IP. Approximately 5-10 seconds later it should
   start responding again. Telnet again and verify you're on linuxha2.
   If it happens but takes more like 30 seconds, something is wrong.
   
   If you get this far, it's probably working, but you should probably
   check all your heartbeats, too.
   First, check your serial heartbeat.  Unplug the crossover cable from
   your eth1 NIC that you're using for your udp heartbeat.  Wait about 10
   seconds.
   Now, look at /var/log/ha-log on linuxha2 and make sure there's no line
   like this:
       1999/08/16_12:40:58 node linuxha1.linux-ha.org: is dead
   If you get that, your serial heartbeat isn't working and your second
   node is taking over.  To avoid any problems, shut down heartbeat on
   the first node, then test your null modem cable.  Run the above serial
   tests again.
   
   If your log is clean, great.  Re-connect the crossover cable.  Once
   that's done, disconnect the serial cable, wait 10 seconds and check
   the linuxha2 log again.
   If it's clean, congrats!  If not, you can check /var/log/ha-log and
   /var/log/ha-debug for more clues.
   
   Appendix A - Crossover Cable Construction
   
   Your cable diagram should be as follows:
   
       Connector A     Connector B
   
   
   Connector A Connector B
      Pin #       Pin #
        1           3
        2           6
        3           1
        6           2
        4           7
        5           8
        7           4
        8           5
                                      
   Rev 1.1.0
   (c) 2000 Rudy Pawul
   [2]rpawul@iso-ne.com

References

   1. http://linux-ha.org/download
   2. mailto:rpawul@iso-ne.com
