Curtis,

Here is a first draft of the specification you asked for.

Feedback is appreciated.  I'm not expecting any more major design changes
from here on out, but if something comes up I'll let you know.

Brian

Contents:

 1 - intro
 2 - features
 3 - architecture overview
 4 - startup semantics
 5 - dynamic configuration
 6 - things still vaguely defined

---------
1 - INTRO
---------

The initial goal of (of this yet unnamed project) was to replace resonate
with free software so that we could actually sell some clusters :)

This goal still holds and I am working towards it.

------------
2 - FEATURES
------------

So the software is to have the following features:

 - level 4 load balancing
 - director node failover
 - service level monitoring
 - dynamic configuration and administration through web gui

In more detail:

Wensong Zang's lvs kernel code is used to perform the actual level 4 switching.
This is at the heart of the system.

Each linux director node has a backup node specified and heartbeat takes care
of performing the actual failover.

Each linux director also runs a copy of mon that monitors its realservers.
Realservers are removed from service when their service goes down.

Dynamic configuration and administration are to be provided by a web based
gui in conjunction with the lvsd daemon which runs on each host.

The basic features that the gui is to have are the following:

 - add/remove nodes from cluster 
 - add/remove vips to cluster
 - add/remove realservers to vips
 - monitor system performance

----------------
3 - ARCHITECTURE
----------------

The cluster is architected as follows:

Heartbeat is used as the lowest level software interface to the cluster.

Heartbeat provides low level clustering (llc) services such as node
membership, interface status, and reliable cluster communications.

All cluster control communication is passed through heartbeat.

I'm using a hacked copy of heartbeat that supports a simple active, backup
paradigm for service failover.

The only restrictions that this places on the assignment of vips to the cluster
is that there can only be one backup node (this would be easy to fix later)
and that vips with the same ip but different ports need to have the same 
primary and backup directors.  

Vips running on a director can failover to different backups only if they
have different a different ip address.

Each machine also runs the lvsd.  Lvsd is the daemon responsible for running
the other software components as well as for calling ipvsadm to change 
the scheduling rules in the kernel.

Each linux director runs a copy of mon which monitors the services running
on the realservers allocated to the vip.  On monitor failure, mon calls an 
alert which causes lvsd to remove the node in question from service until
it comes back up again. 

In addition there will be a gui interface daemon which will talk both on a 
TCP socket to the machine which servs up the web gui, and to heartbeat.  
This daemon can run on one or all of the machines in the cluster.

This gui interface daemon then talks to mod_perl running within apache 
somewhere.  It is this machine which serves up the actual client interface.

It is intended that you can run one or more of the gui interfaces and also 
have any number of concurrent administrators without problems.

---------------------
4 - STARTUP SEMANTICS
---------------------

The cluster starts up as follows:

On each machine, lvsd starts first.  The config file /etc/lvsd.conf is
then read.  If these are not the same across the cluster there will be 
an error later.

The config file (currently) contains the following directives:

LogDir         - where to write the lvsd.log
DigestKey      - The key to use when computing message digests
HeartbeatPort  - The udp port for heartbeat commucations
HeartbeatIface - The ethernet interface to use for heartbeat 

There will likely be more options for specifying heartbeat communication
mechanisims in the future.

In addition there are the Node and Vip directives:

Nodes are specified like this:

Node node1 192.168.0.1 

Vips are specified like this:

Vip {
  ...
}

Vips accept the following directives:

PrimaryDirector - The name of the director node
BackupDirector  - The name of the backup director node
Scheduling      - The type of scheduling to use.  Uses the same abbreviations
                  as ipvsadm - i.e. $1 in ip_vs_([a-z]+).o
ServiceMonitor  - The name of a mon monitor that can be used to verify
                  realserver operation
Persistent      - Should clients always hit the same backend server?  
Port            - The port of the vip
ServiceIP       - The ip of the vip
Realserver      - The name of a node to be realserver for this vip


Once the config file is read by each lvsd, config files for heartbeat are
written.  These are /etc/ha.d/ha.cf and /etc/ha.d/haresources.

Then heartbeat is started through /etc/rc.d/init.d/heartbeat start.

Heartbeat will then establish communications across the cluster and decide 
what resources should reside where.  I.E. in the event the primary director
node for a vip doesn't come up heartbeat will bring it up on the backup.. 

The actual execution of ipvsadm is handled by the lvsd upon recipt of the
proper message from heartbeat.

After the vip is created in the kernel routing tables, Config files for mon
are written into a /tmp/.$$ directory and mon is started to monitor the 
services on that vip.  

Mon enables all servers by default, and then disables them when the monitor
for that server fails.  This exposes us to a small window in which the vip
could give a refused connection, if the cluster is brought up when there are
falied servers.

Mon will continue to monitor a failed server in order to put it back in
service when it comes back up.  

In the event that there is more than one vip associated with a director, 
when the trap to add the second vip comes in there will already be a copy
of mon running so the thing to do is to add the watches via the mon client
interface so that we can keep only one copy of mon running at a time.  

Also at this point it would be good to have a check to make sure there
aren't any descrepant configs.  After that check we can allow for 
control messages to alter the config of the cluster.
 
At this point in time the cluster is hopefully up and running, despite any
failures that may have occured.  This is where the real fun begins :)

------------------
5 - DYNAMIC CONFIG
------------------

So now that the system is functional, we can connect to it with the client
and see what's up.

As written above, the client webserver talks to a special client daemon
sitting somewhere in the cluster.  This daemon then pushed control messages
out through heartbeat which in turn are picked up by the appropriate lvsd
processes running on the cluster, which make the necessary changes.

The gui will be responsible for presenting the following information:

 - nodes in the cluster
 - vips in the cluster
 - realservers for each vip
 - performance stats for each vip (number of connections)
 - status of each node in the cluster (up/down) (from heartbeat)
 - status of each realserver (service up down) (from mon)

In addition, the gui will allow for these three types of changes to the
state of the system:

 1 - add/remove node from cluster
 2 - add/remove/modify vip from cluster
 3 - add/remove/modify realserver from vip

We'll look at each of these in turn:

 1 - At the top level, the lvsd.conf file contains a list of all nodes in 
     the system.  This list is then passed to heartbeat.  All nodes that  
     claim to be part of the cluster must be listed in this config file.  
     Otherwise they are not allowed to participate.
 
     So when adding a new node to the system, the heartbeat config file 
     must be modified and we need to run /etc/rc.d/init.d/heartbeat reload,
     which, should restart heartbeat but *keep* resources so that we can 
     add a node without interrupting existing services.

     Removing a node is handled similarly, with care taken to deal with 
     resources that were allocated to the node in question.

 2 - Just as there is a nodelist stored in /etc/ha.d/ha.cf, there is a list
     of resources (vips in our case) stored in /etc/ha.d/haresources.

     This list of resources needs to be updated in the case that we add or
     remove a vip.  Then we need to pull the same reload trick as in #1.
     I will need to make sure that the boundary conditions are handled 
     properly here, specifially in the case of when you tell heartbeat to 
     reload, and the new config says nothing about some existing vip.
     Hopefully heartbeat does the right thing.  If not I'm sure I can
     make it do the right thing :)

     The only vip modification that can be done online (besides realserver 
     stuff) is changing the scheduler, which doesn't require a heartbeat 
     restart. 

     If the user wants to change the ip or port of the vip it is easier to
     have them drop the vip and create a new one as service will be 
     interupted anyways.

     Everything else is #3:

 3 - If the user wants to change what realservers are associated with a vip
     then we need to notify mon of the changed monitoring requirements.

     This is easy because mon has a client interface designed to do just that.

     So when we get a message to add a realserver then we will tell mon 
     to start monitoring an additional host, and if we get a message to 
     drop one, we will similarly stop monitoring it. 

     This is easy because we are running eactly one mon process per director 
     node, and that process runs on the director node.  This makes it easy
     to failover both mon and the lvs routing at the same time.

     The only realserver modification that we can make in place is to 
     change the weight of the realserver.  It is possible that we could
     support multiple forwarding methods in the future as well.

     For now I have hard coded the direct routing method, which is very
     similar to the ip tunneling method except that the extra ip header 
     is skipped because the machines are assumed to be on the same subnet.

     This could be changed easily.
 
 
-----------------------------
6 - STUFF TO FIGURE OUT STILL     
-----------------------------

There are some pieces that haven't really been fully spec'd out here.

These include the protocol spoken between the gui daemon and the web server,
the protocol spoken by the lvsd, and the actual layout of the web gui.

----------------------
7 - STUFF LEFT TO CODE
----------------------

The lvsd is half written at this point.  

I'm in the middle of re writing the message library it uses to communicate
via heartbeat.  

Right now it does the whole startup procedure but uses it's old udp port
for the different traps.  This needs to be changed to do everything through
heartbeat.

After that, I need to write the gui daemon and start hooking up the client
and adding the dynamic config stuff.

I'm hoping that in a few weeks I'll have most of that done.
