High-Availability Storage with Slackware, DRBD & Pacemaker

April 27th, 2010 Leave a comment Go to comments

The Problem

The current storage system at work is unmanageably insane. Storage is split across a ton of different machines, mostly Solaris 7 & 9 with one Solaris 10 machine. There’s one hardware RAID5, two software RAID5s and several almost randomly thrown together stripes and mirrors, all spread very widely and accessed through samba & NFS, sometimes using autofs and NIS, and all through what is effectively a single top-level directory pointing at everything. It took a whole whiteboard just to describe the RAID5 layouts.. the others are still somewhat undocumented. Also, a major problem is that we are running out of space. When Sun quoted over £4,000 to expand our available space by 500GB, I knew it was time to do something serious.

The Solution

Why, migrate to Slackware of course!

The Details

Primary Goal: Low-Cost Redundancy

Previously all disaster-mitigation rested on expensive Sun hardware and software support contracts that didn’t even cover the network sufficiently. If something died, it would be down until Sun helped to fix it with engineer(s) and replacement parts. So, we were spending insane amounts of money and getting very little in return and with a lot of downtime at risk. What I aimed for in the replacement was to have redundancy built-in to everything. The idea being that if something failed, redundant equipment would take over and the failed part would be replaced under warranty for as long as the warranty lasts, or replaced at cost if after warranty-expiration. Because the parts would all be normal PC components rather than expensive proprietary Sun equipment, replacement costs are easy to accept.

Secondary Goal: Simplification

The old system is a mess. Even the longest-serving members of staff have trouble finding where things are stored and duplication of data is a big problem. The new system has to be simple. In terms of storage simple means a single logical storage structure; a single root directory under which is a well thought out hierarchy of directories each with its own clear purpose. This would not only make it easier for the users, but for me as well. Administration of everything from authentication to backups would become a breeze.

The Disks

SCSI vs SATA

Most of the old equipment used SCSI disks and it was the cost of Sun’s proprietary U320 SCSI disks that generated the £4,000 quote for 0.5TB extra storage. With 1TB enterprise-class SATA disks now available for £100 each, it was a no-brainer. It doesn’t matter how many fail over the life of the system, it’s still not going to be worth running SCSI disks. For the same price Sun asked for 0.5TB I could buy 40TB raw SATA storage. The only concern would be performance. With a little analysis, it is easy to see that the performance of a new SATA system would be more than sufficient. The old system had a single hardware-RAID enclosure of 0.5TB running U320 Sun SCSI disks that provided good I/O performance, but the rest of the system couldn’t even compare with it. The second best systems were running software RAID5 on SPARC machines with only a gig of RAM each and minimal processing speed, needless to say the third and fourth best machines didn’t even approach that. Decision made. If the staff were used to the awful performance of the previous systems, there’s no way I could do worse with a well-designed SATA based system.

UPDATE 20100427:
bonnie++ proved me right. The new system is between 3 and 10 times faster than the old hardware RAID5 and up to 20 times faster than the old software RAID5s, depending on the test.

RAID

Since this is a storage system, redundancy revolves around how the data is stored and protected. I did a huge amount of research into storage systems and redundancy, looking at real-world examples and case studies including everything from the expensive proprietary solutions to the cheapest open solutions. RAID is central to just about every single one of them (our old system included). If the data matters, you absolutely have to have some kind of on-line redundancy otherwise a single disk failing could put you back to your tape backups which may or may not have succeeded. The question is.. what RAID level?

RAID0 was never a consideration. There’s no redundancy.

RAID1 is too primitive for our needs. The use of space is inefficient and you need to put logical volume management on top if you want volumes any larger than your physical block devices.

RAID5 is what has been in place here for some time, and for small data sets it is a reasonable option. But for the amount of data we are now talking about handling I consider RAID5 to be tantamount to professional negligence. If one disk dies, the amount of time spent rebuilding the array is long enough that a second disk is too likely to die, irrecoverably trashing the whole array. This actually happened a few years ago and due to an extreme clusterf*ck by all involved left the company without one of its critical storage servers for 5 whole months.

With these more primitive RAID levels discounted I started to look at the serious contenders: RAID6 and RAID10.

RAID6 is a good solution to a simple problem. RAID5 fails if you lose two disks in rapid succession which is a likely occurrence. RAID6 adds a second parity disk so that you may lose two and still be okay. Unfortunately this second parity calculation means it’s worse on performance than RAID5. NetApp have a good solution to the performance problem with their proprietary RAID6 implementation which they call RAID-DP (double parity). But have you seen their price list?! It’s horrifying.

RAID10 is pretty good for redundancy but just doesn’t sit right with me. I could understand mirroring a stripe, but standard RAID10 is striping mirrors; it just feels wrong. There’s too many cases where losing the wrong disks together will cause data loss. Where RAID5 and RAID6 don’t really care which disk is which, RAID10 does.

So, RAID levels 6 and 10 are close to what I want, but neither ticks all the boxes. More than that, neither will handle the loss of three disks. It may be unlikely, but paranoia is never a bad thing when you’re dealing with critical data and you have to at least be able to answer the question if only for your disaster recovery plan: what do I do if I lose three disks? If the answer is panic, then your design is flawed. In the above cases the answer is just revert to backups. Well, call me paranoid, but I don’t trust tape backups. They are a last resort as far as I am concerned; an afterthought if you will. They should not form part of your primary plan, they are there for when absolutely everything else fails.

So, if neither 6 or 10 will do.. what do you do? It took me a while to come up with the answer, but I did come up with it: RAID61.

RAID61 (or RAID 6+1) is what I am calling my solution. As the name suggests, it is a combination of RAID levels 6 and 1. Instead doing it the RAID10 way and having each member disk of the RAID6 array mirrored which would be difficult to implement and risky based on exactly which disks could die without affecting the array, this is the other way around. Two complete RAID6 arrays each a mirror of the other. This allows for the loss of any 5 disks at once. And, if they are the right disks (RAID10 thinking) you can lose as many disks as you have in the RAID6 and still two more with no data loss and no service interruption. In terms of performance, you can expect the same performance as RAID6. It’s not quite as good as RAID10 performance for a small number of disks, but it’s definitely acceptable and could possibly beat RAID10 as the number of disks increases.

You may have noticed that the space efficiency isn’t brilliant: (N+2)*2 where N is the number of disks required to provide the accessible storage space required. Some people may not consider that viable for their environment, but when you really look at the details, you are looking at almost the same space-efficiency as RAID10, but with an order of magnitude more redundancy. It’s a judgement call which you go with, but there’s no question which one I prefer when given the arguments above; especially since I’m talking about a SATA environment where raw disk cost is so low anyway.

The Complete Hardware Setup

You might be wondering how I intend to implement all this. Well, this is where the design really shines.

  • Two completely independent SuperMicro servers with Xeon processors and an empty CPU slot for later instant-upgradability.
  • Slackware64-13.0 Operating System
  • 3ware 9650SE-12ML Hardware Raid
    • OS installed on hardware RAID1 using Western Digital Velociraptor hard disks
    • Storage on RAID6 (256K stripes) using 1TB disks from Western Digital and Seagate, all sourced from different vendors to ensure different production batches
  • RAID6 synchronised over a dedicated gigabit network connection using DRBD
  • Pacemaker handling fail-over management.
  • Triple-redundant PSUs on the SuperMicro chassis
  • PSUs cross-plugged into two APC 2200VA UPSes so both UPSes support both machines.

The Resulting Redundancy

Let me just walk you through the redundancy this RAID61 set-up provides:

X Dies: Result (Data Loss, Performance Loss, Service Interruption)

  • 1 Disk: Best server primary during rebuild (None, None, None)
  • 2 Disks (same array): Best server primary during rebuild (None, None, None)
  • 2 Disks (diff arrays): Best server primary during rebuild (None, Slight, None)
  • 3+ Disks (same array): Best server primary, rebuild/resync failed (None, None, None)
  • 3 Disks (diff arrays): Both rebuild, best array primary (None, Slight, None)
  • 4 Disks (1+x): Best server primary & rebuild, rebuild/resync failed (None, Slight, None)
  • 4 Disks (2+2): Both rebuild, best is primary (None, Worse, None)
  • 5+ Disks (2+x): Best server primary & rebuild, rebuild/resync failed (None, Worse, None)
  • 6+ Disks (3+x): Data loss. Manually reconstruct & revert to backup (None, Complete, Yes)
  • 1 RAID controller: Best server primary. Manual intervention (None, None, None)
  • 1 Motherboard: Best server primary. Manual intervention (None, None, None)
  • <5 PSU modules: Best server primary. Manual intervention (None, None, None)
  • Other core hardware: Best server primary. Manual intervention (None, None, None)
  • 1 UPS: Replace UPS. Neither server affected (None, None, None)
  • Power outage (brown or black): Both servers supported by 2xUPS. Auto-shutdown at critical battery (None, None, Only during extended outage)
  • Armageddon / The Rapture / Alien Invasion: Pray (Yes, Complete, Yes)

The Implementation

This is the difficult part: actually doing it.

Hardware

Thankfully I have a friendly server vendor that’s good at getting whatever you ask for however you ask for it for a very respectable price. So sourcing the hardware was not too difficult. It cost a bit extra to get the disks from lots of different sources because of shipping prices, but that was expected and done successfully. Absolutely everything, UPSes included, for a shade over £7,000. Absolutely awesome. Especially considering that the same solution from NetApp would cost in excess of £25,000 (approximately, based on real quotes) and would probably require a Windows server in the middle too.

Worth noting that the 9650SE-12ML RAID cards were on the firmware from 9.5.1.1 and so couldn’t do 256K striping as it’s a relatively new feature, but they provide a downloadable ISO boot disk that lets you update the firmware quickly and easily.

Once booted I spent a little time playing with the BIOS and setting up the RAID configuration identically on both machines which was such a beautifully simple experience in comparison with other RAID BIOSes I have dealt with in the past. I really like 3ware and am very sad they’ve been bought out by LSI (whose equipment I have sadly had to deal with before).

Because the storage arrays are 3TB in size (5x1TB RAID6 with 1 Hot Spare), a standard MBR wouldn’t do the job, so I had to discover the wonder of GUID Partition Tables (GPT) and the fact that none of the software I use supports it. cfdisk, fdisk and sfdisk all fall over and die at the prospect of a GPT. For a while I found myself stuck with GNU Parted which I really hate, but did manage to find an fdisk clone for GPT.

Software

I used PXE with NFS to boot and install Slackware64-13.0 onto one of the servers. Immediately built a custom kernel exactly as I would on any other machine. Then sbopkg added the extra few bits like htop, nload, lshw, nagios-nrpe (my own slackbuild submitted to SBo), nagios-plugins (mine too) and hddtemp (yep.. that too). A few preference modifications to the setup later and it was ready for the software specific to this setup: APC PowerChute, 3ware 3DM2, DRBD & Pacemaker.

APC PowerChute

As an application I hate PowerChute. It’s written in Java (ugh) and gives you no flexibility whatsoever, but they provide an installer and an init script and it seems to work reasonably well so it is at least acceptable. The default install location is in /opt and moving the provided init script to /etc/rc.d/rc.PBEAgent and calling it from rc.local finished the job.

3ware 3DM2

3DM2 is very similar to PowerChute in that you have to use the provided Java installer which is hateful but works in most cases. I didn’t in mine of course. For some reason, even using the CLI mode, the installer would take literally 20 or 30 mins to jump from one screen to the next. I have no idea why and I couldn’t replicate the behaviour on my 32-bit desktop machine. I have absolutely no explanation for why this is and after hours of looking into the problem I simply gave up and waited for the installer to complete, which at least it did. It took all afternoon, but it worked. And just for good measure I have taken a copy of the installed files so I can re-install in seconds at a later date if necessary. I had considered doing this from a 32-bit Slackware machine, but I think the installer may do some environment-specific configuration so I didn’t tempt fate.

With the provided init script moved to /etc/rc.d/rc.3dm2 and called from rc.local that was another one dealt with.

DRBD

I hadn’t looked at the DRBD installation process prior to doing it and discovered the choice of building a kernel module or patching the kernel source. It doesn’t seem there’s all that much difference between the two options, but since I like hacking the kernel and already had a custom kernel that included the 3ware-9xxx driver it seemed logical to patch the kernel and recompile; especially since kernel compilation only takes 7 minutes on this machine (-j13).

Update: Even though they still provide the capability to create a kernel patch, less than two weeks ago, the kernel patching instructions were removed from the DRBD site. All the evidence I can find points to the fact that they do not want you building it as anything but a module, but I can’t for the life of me work out why. I believe that DRBD is about to officially enter the Linux kernel where you will have the choice of building as a module or building into the kernel – so why they have some requirement to build it as a module I don’t know. Even most of the stuff you can run to monitor DRBD will complain that the kernel module is not loaded if it is compiled-in.

To compile-in (v8.3.4):

cd /usr/src
tar -xvf ~/drbd-8.3.4.tar.gz
cd drbd-8.3.4
make clean
make KDIR=/usr/src/linux kernel-patch
cd /usr/src/linux
patch -p1 < /usr/src/drbd-8.3.4/patch-linux-drbd-8.3.4

Then either manually modify your .config file to add:

CONFIG_BLK_DEV_DRBD=y

or just `make menuconfig` and go to Device Drivers -> Block Devices, highlight “DRBD Distributed Replicated Block Device support” and press “y”. DO NOT enabled DRBD tracing. You don’t need it and last time I checked it could cause major instability.

Then compile your kernel, install your modules, update /boot, run lilo and reboot.

You will also need to install the userland tools for DRBD:

cd /usr/src/drbd-8.3.4
make install-tools

At this point I set about starting and testing DRBD. I setup an XFS filesystem on the 3TB storage array, leaving space at the end for safety and for the DRBD internal metadata. Exactly how you do this is up to you as your system will be different to mine and it’s very important that you understand what you’re doing before you even start. I was a little worried to start with because the initial synchronisation was estimating completion in approximately 3 months time(!!). didn’t take long to discover I hadn’t set a sync rate and so it was being limited by the default rate. I changed it to 110M and 20 hours later the initial sync was complete and fully functioning.

UPDATE 20091109:
I have decided to give up and bend to the will of Linbit and do module installation. Primarily because, since they’ve made some modifications to the source for v8.3.6 it’s become very much easier to made a SlackBuild out of it.

You can find my SlackBuild for v8.3.6 on SlackBuilds.Org and also here.

UPDATE 20100308:
DRBD v8.3.7 is out and has also entered the 2.6.33 kernel which is now in Slackware{,64}-current. The SlackBuild has been split into two: drbd-kernel and drbd-tools. If you are using a 2.6.33 or later kernel you only need the drbd-tools package. With an earlier kernel you need both.

Details and downloads: http://blog.tpa.me.uk/2010/03/04/drbd-8-3-7-slackbuilds/

Also submitted to SlackBuilds.Org where they should be available soon.

Pacemaker

Oh My God!

Setting up the pacemaker stack is the hardest thing I have yet had to do in my professional career. It’s insanely complicated. It used to be reasonably simple. You’d set-up heartbeat and that was it. Now, you have to install a minimum of four different components, each one completely unstable and barely documented. Your only other choice is to use the older heartbeat stuff which is already way past its sell-by date. What makes it even harder to understand initially is that the homepage you need for all of this is the Cluster Labs site, which concentrates on Pacemaker, whether it’s on Heartbeat or OpenAIS/Corosync, not the Linux HA site which concentrates on Heartbeat. Most people who are vaguely aware of previous Linux high availability set-ups know the system as “Hearbeat”, Heartbeat being the communication core of the system which would then have Pacemaker on top for resource management. The new implementation is known as “Pacemaker”, but with the heartbeat components replaced by OpenAIS. However, OpenAIS has been split into two projects: OpenAIS and Corosync, Corosync basically being the guts of the round-robin communication protocol and OpenAIS being some extra gubbins on top. It’s a ridiculous and insane mess and I’m confusing myself just trying to describe it.

Soo.. yeah it’s a mess.

Having said that, I shall continue to describe what I’ve done to get to a working setup. Bear in mind that each of the steps I’ve taken represents days or even weeks of: playing around clawing at brick walls trying to get information, compiling, recompiling, re-recompiling, upgrading, restarting, finding bugs, reporting bugs, upgrading around bugs, starting from scratch, modifying code to fit Slackware and rewriting configuration files so many times I can’t even remember where I started.

Note: It would appear that the intention of the developers is to only ever distribute the software via distribution specific packages, which is basically just RHEL, SUSE & Debian (the primary maintainers are all SUSE staff) and because they work exceedingly closely with the distributions and the package releases for them, they couldn’t give a toss that it’s insane when approached from a source-installation point of view.

UPDATE 05/11/2009
As of right now, the latest Pacemaker tip (and therefore the 1.0.6 “stable” release) will not compile on Slackware64 because of a hard-coded reference to /usr/lib in configure.ac. This has been reported, but not fixed in mercurial yet. If you need this patch, drop a comment on this page. I’m not actually expecting anyone to need the patch before the tip gets fixed.

Installation Process:

  1. Install Cluster Glue
  2. Install Cluster Resource Agents
  3. Install Corosync
  4. Install OpenAIS
  5. Install Pacemaker

In all of that, only OpenAIS and Corosync have normal release versions as you would expect with minor revisions for features & bugfixes. Cluster Glue, Cluster Resource Agents and Pacemaker all live in a mercurial repository with absolutely no meaningful release tags. The latest version of Pacemaker for example is 1.0.5, however that means nothing as the officially tagged 1.0.5 is over two months old and very very broken; not even suitable for a test environment. The only way to proceed is to use the mercurial “tip” (HEAD to you and me) which is tagged simply with a hex string. The same is true for the base Cluster packages. It’s hit or miss whether you’re going to get something that works, but it’s your only choice.

In order to ease my pain, I have put significant effort into making SlackBuilds for all of these components which are based on mercurial hex versions. The versions used in the SlackBuilds are what I’m currently running in production, but I’d recommend updating the versions to the latest tip before running them.

Note: These packages are tagged as if they have come from SlackBuilds.Org (SBo) as I hope to submit them there when I’ve set the files up exactly as they want them and I can’t be bothered to re-tag them just to site them here. Also, they are currently set up for x86_64 and a default of 13 jobs during make. Edit the scripts and adjust to your needs.

  1. Cluster Glue
  2. Cluster Resource Agents
  3. Corosync
  4. OpenAIS
  5. Pacemaker

Install in that order as each one depends on the previous one.

UPDATE 20100422:
The Pacemaker stack has finally been approved by SlackBuilds.Org so you may find the latest versions of these SlackBuilds there. I recommend sbopkg to get it all installed nice and simply
.

That will get you an installed Linux-HA software stack, but you haven’t even started yet. You have to configure everything. There’s so much to learn and so much to configure I’m not going to go through it all, but I am going to give you some notes on the most important bits:

System Boot
rc.local

if [ -x /etc/rc.d/rc.logd ]; then
    /etc/rc.d/rc.logd start
fi

if [ -x /etc/rc.d/rc.corosync ]; then
    /etc/rc.d/rc.corosync start
fi

rc.local_shutdown

if [ -x /etc/rc.d/rc.corosync ]; then
    /etc/rc.d/rc.corosync stop
fi

if [ -x /etc/rc.d/rc.logd ]; then
    /etc/rc.d/rc.logd stop
fi

Enable logd & corosync startup

chmod a+x /etc/rc.d/rc.{logd,corosync}
  • Ignore rc.ldirectord and rc.openais
  • OpenAIS might have it’s own startup script and its own call to “aisexec” but all it really does is start corosync with some flowers around it
  • ldirectord is one of those things you don’t need unless you know you need it
  • Just ignore them both unless you know you want/need them

NFS Resources
Normally, any service you want to manage with pacemaker should not be started with your normal system init scripts and should be left to the Resource Agent. However, the Heartbeatr OCF Resource Agent for NFS (ocf:heartbeat:nfsserver) actually delegates the grunt work to your own init script. In Slackware, if rc.nfsd is executable so that the OCF RA can run it, it would normally be called at system start-up too. If you want to manage NFS via pacemaker, you need to edit rc.M, rc.K and rc.6 in /etc/rc.d to comment out the calls to /etc/rc.d/rc.nfsd and then make it executable (chmod a+x /etc/rc.d/rc.nfsd).

Unfortunately it’s not quite that simple either. OCF Resource Agents are required to return very precise exit codes and they expect precise exit codes from init scripts. They also require a monitor() or status() option in order to function correctly. To that end I’ve had to do something quite horrible to the rc.nfsd script to make it work. Effectively I’ve added a status() routine, but in order to make sure it’s LSB compliant for the OCF RA, I’ve stolen a pre-compiled checkproc binary from a 64-bit SUSE machine as it produces the exact return codes the RA expects. Here is my modified script: rc.nfsd

Yeah, I don’t like using SUSE code any more than you do, but SUSE wrote most of the HA code and they wrote checkproc and they’re designed to both conform to the exact same standards, so live with it.

It’s not over. Oh no.

The nfsserver RA also has calls to mktemp in order to do its stuff. But whoever wrote it hardcoded the mktemp calls with /sbin/mktemp. This is not where mktemp is in Slackware, it’s in /usr/sbin/mktemp. Here’s a copy of the RA with the modifications made: /usr/lib/ocf/resources.d/zordrak/nfsserver

Notice it’s from /usr/lib/ocf/resources.d/zordrak not /usr/lib/ocf/resources.d/heartbeat. The best way to handle local modifications to OCF RAs is to make your own directory for the modified copies, then call them as such from the pacemaker configuration. For example, this OCF is called from my pacemaker config as ocf:zordrak:nfsserver instead of ocf:hearbeat:nfsserver. If I upgrade in the future, I don’t have to worry about my copy getting overridden during the upgrade.

Samba Resources
Samba isn’t as hard. You just need to `chmod a-x /etc/rc.d/rc.samba` to stop it from being started and stopped by the master init scripts and then use this OCF RA (which is not provided with the code): /usr/lib/ocf/resources.d/zordrak/samba

Pacemaker Configuration
Be VERY careful with quotes when configuring pacemaker. If you are editing the XML directly (whether using `crm configure edit`, or using cibadmin to export/import) then all parameter values should be enclosed in quotes. If, however, you are modifying parameters using the crm configure (live) command line or similar, then don’t use quotes. I am assured this very bad quote-handling will get cleaned up in coming commits, but I can’t be sure if or when. If you have a problem, check whether quoting has caused it. It is for this reason I didn’t use a symbol in a STONITH IPMI reset option because depending on how you modify the config, it might not even be possible to pass the symbol because you cant quote or escape it, but you cant get it past the shell either. Obviously there are ways and means of achieving whatever you want to do, but this is just a warning to be very careful with configuration quotes.

My CIB
Here I give you a sanitised version of my CIB to give you an idea of the configuration I have set up and how it looks.
There is a Master/Slave resource called ms-store_drbd which handles the master-slave configuration of the DRBD resources (store_drbd). There is then a group called store_serv which is dependant upon the DRBD resource and can only run on the same machine as DRBD and only once a DRBD node has become primary. The store_serv group consists of the filesystem on the DRBD device, a bind mount for sharing nfs state data, an IP address, NFS services and Samba services:

node node1
node node2
primitive nfs_fs ocf:zordrak:Filesystem \
        params device="/mnt/store/nfs" directory="/var/lib/nfs" options="bind" fstype="none" \
        meta target-role="Started"
primitive nfsd ocf:zordrak:nfsserver \
        params nfs_init_script="/etc/rc.d/rc.nfsd" nfs_notify_cmd="/usr/sbin/sm-notify" nfs_shared_infodir="/var/lib/nfs" nfs_ip="1.2.3.4" \
        meta target-role="Started"
primitive samba ocf:zordrak:samba \
        params smbd_enabled="1" nmbd_enabled="1" winbindd_enabled="0" smbd_bin="/usr/sbin/smbd" nmbd_bin="/usr/sbin/nmbd" smbd_pidfile="/var/run/smbd.pid" nmbd_pidfile="/var/run/nmbd.pid" testparm_bin="/usr/bin/testparm" samba_config="/etc/samba/smb.conf" \
        meta target-role="Started"
primitive st-node1 stonith:external/ipmi \
        params hostname="node1" ipaddr="1.2.3.9" userid="admin" passwd="password" interface="lanplus" \
        meta target-role="Started"
primitive st-node2 stonith:external/ipmi \
        params hostname="node2" ipaddr="1.2.3.10" userid="admin" passwd="password" interface="lanplus" \
        meta target-role="Started"
primitive store_drbd ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="59s" role="Master" timeout="30s" \
        op monitor interval="60s" role="Slave" timeout="30s" \
        meta target-role="Started"
primitive store_fs ocf:zordrak:Filesystem \
        params device="/dev/drbd0" directory="/mnt/store" fstype="xfs" \
        meta target-role="Started"
primitive store_ip ocf:zordrak:IPaddr2 \
        params ip="1.2.3.4" nic="eth0" cidr_netmask="24" \
        meta target-role="Started"
group store_serv store_fs nfs_fs store_ip nfsd samba \
        meta target-role="Started"
ms ms-store_drbd store_drbd \
        meta clone-max="2" notify="true" globally-unique="false" target-role="Started" master-max="1" master-node-max="1" clone-node-max="1"
location cli-prefer-store_serv store_serv \
        rule $id="cli-prefer-rule-store_serv" inf: #uname eq node1
location l-st-node1 st-node1 -inf: node1
location l-st-node2 st-node2 -inf: node2
colocation store_serv-on-store_drbd inf: store_serv ms-store_drbd:Master
order store_serv-after-store_drbd inf: ms-store_drbd:promote store_serv:start
property $id="cib-bootstrap-options" \
        expected-quorum-votes="2" \
        stonith-action="poweroff" \
        no-quorum-policy="ignore" \
        dc-version="1.0.5-b3bf89d0a6f62160dc7456b2afa7ed523d1b49e8" \
        cluster-infrastructure="openais"
Be Sociable, Share!
  1. January 11th, 2010 at 17:56 | #1

    As a maintainer of Corosync and OpenAIS packages, I agree in general the documentation needs significant work. Unfortunately there are not alot of people interested in writing documentation. But thank you for the article, it will likely help someone else.

    Please keep in mind the sources are meant to be compiled by distribution experts. In our Corosync community (and I don’t mean the Pacemaker community) we target the source tarballs towards people that package software daily. This doesn’ t mean the software can’t be built from source. It does require some good understanding of how modern software packages are put together, specifically how to use autotools effectively.

    I’m sorry you had such a terrible time putting together the packages. That is not our intent. Hopefully you can work with the Slackware maintainers to release a combined package set for Slackware.

    If you need further assistance or have suggestions for how we could make our packaging more suitable for distributor packaging, mail us at

    openais@lists.osdl.org

    Regards
    -steve

  2. January 12th, 2010 at 10:06 | #2

    @Steven Dake
    Thanks for your comments Steve, they are very welcome. For what it’s worth things have become slightly easier since I originally wrote this although obviously there still is some way to go.

    With respect to distribution specificity, you should know that Slackware lives off of its community. While the core team work very hard to provide the core OS and the many wonderful packages with which it comes, 3rd party software is mostly provided by a strong and dedicated community of Slackers; specifically via the semifficial SlackBuilds website (http://slackbuilds.org). When 3rd party software is provided, it is not provided as a binary package, but instead as a SlackBuild script. These scripts, at their most basic level, are simply an automation of the autotools build process with configure options specified in “The Slackware Way”, providing a package that can then be installed.

    With that in mind, given that Pacemaker, Corosync, OpenAIS, CG and CRA are not the type of applications likely to enter the core Slackware distribution, my SlackBuilds are as close to the official Slackware package that we are ever likely to see (unless alienBOB, core team member and dedicated community leader, decides to do his own).

    With respect to the difficulties I had getting to a stage where I could provide SlackBuilds to the community the main problem was simply that, at the time, the configure/build process was done (and documented as such) with custom wrapper scripts designed to work for specific distributions. Now that the stack comes, as one might expect, with a configure script ready to go to start the build process, life is a whole lot easier for source compilers and distribution packagers.

    At the same time I really found difficulty getting a complete understanding of the 5-layer software stack, the purpose of each component and the relationship between them. For example, I just about thought I had a grip on what was going on with the pre-corosync 4-layer stack when OpenAIS was split into corosync and OpenAIS at which point I found a distinct lack of information on just what each part was.

    With the slowly evolving documentation available on on the ClusterLabs site and on write-ups such as this one, this is another barrier that is slowly being brought down for people entering the fray to get to grips with the software.

    If I have one salient comment to make, it is that it is very difficult to decide what versions of each part of the stack to use when putting an installation together. Being immature in places, it’s often necessary to use the mercurial tip in order to get the functionality and stability you might need. While this is absolutely expected of rapidly developing software, when there are potentially 5 different repositories and you need to put your flag in the ground somewhere with each one, not knowing precisely how code changes in one might affect the functionality of another, it’s really just a matter of guesswork and hoping you picked right. Pacemaker-1.0.5, for example, lasted 3 months as the stable version, but in all honesty after about a month was completely obsolete and, if memory serves, incompatible with changes made all over the stack – so anyone coming to the project in October had a confusing time picking which version of each part of the stack to use in order to get even a basic testing setup going.

    Having said all that, I would be remiss not to thank you and the team for your hard work on the project; it is all very greatly appreciated.

  3. March 2nd, 2010 at 10:20 | #3

    Thanks a lot for this guide. I don’t know what I’d have done without it. I’d certainly be messing around, tearing my hair out still.
    I’m building a failover pair of servers that act as NFS storage and also as reverse proxying Apache servers that serve static content and load balance to a number of Tomcat servers.
    I’d started on the project without coming across this blog, and I’d managed to compile all the various stacks and even get corosync working, but after that the Pacemaker stuff just wouldn’t work. I suspect it’s because I didn’t compile the OpenAIS package before that it didn’t work; the Pacemaker site suggests that as corosync is a cut-down version of it, you don’t need to.

    I tried again from scratch using your slackbuilds and managed to get it working. It still took days of messing around with the CIB configuration for it to actually failover. One thing I found is that the CIB I created based on yours failed over when putting the master node into standby, but didn’t failover when I simulated a crash.

    Like you, I also found the documentation to be poor (and sometimes misleading). I never really found a good overview of all the components, and mostly it just gives you example CIB configs and says ‘adapt this for your purposes’ without explaining all the various bits.
    The best documentation I found is on the Novell site at: http://www.novell.com/documentation/sles11/book_sleha/?page=/documentation/sles11/book_sleha/data/sec_ha_manual_config_create.html but of course, that doesn’t help you compile it all for Slackware.

    A final thing; you really do need the SuSE checkproc command for the nfs script to work. I found the source code for this at http://ftp.nluug.nl/os/Linux/distr/pardusrepo/sources/killproc-2.11.tar.gz It’s part of the killproc package.

    Once again, thanks for sharing your trials with the rest of us – it’s an invaluable aid.

  4. lazyadmin
    April 7th, 2010 at 19:47 | #4

    Thanks for guide. I had trouble setting up the samba resource and your guide helped a bunch.

  5. Nicholas
    April 27th, 2010 at 23:03 | #5

    Question, since you were already a Solaris shop, why not look at an OpenSolaris or Solaris x86 / ZFS / NFS -based platform? I’m pretty sure there’s a way to build the same thing without needing to go to a completely different OS.

  6. April 27th, 2010 at 23:20 | #6

    @Nicholas

    There are a number of reasons:
    1. I personally find Solaris to be a vile and hateful operating system.
    2. There was nothing keeping me on Solaris, there was no benefit whatsoever in retaining it.
    3. The only reason it was in use in the first place was my predecessors’ dependence upon ridiculously over-priced Sun hardware and the ubiquitous hardware and software support contracts that come with them.
    4. Slackware is beautiful, secure, stable and simple (don’t get me evangelising) and the rest of the core network services have been running it for years.

    (Sorry I keep modifying this comment, I can’t make up my mind how I want to say it.)

  7. Paris Stefas
    May 13th, 2010 at 12:46 | #7

    After also spending a big amount of time on the linux-ha topic, one thing is for sure for me at least, heartbeat after 2.0.x became unnecessary complex and in version 3 it is now an undocumented nightmare.

    It’s a shame that the development group instead of building on top of the original architecture managed to split heartbeat in 1000 pieces,
    the new functionality should be designed as add-on functionality and not canceling all the documentation and effort made up to the point that heartbeat was USEFUL for MANY people.

    Currently it is good only for its developers.. it’s a pity..

    If it weren’t for people like the author of this article that would loose days or weeks from their lives to understand and try to share their work in a human way, the BEST alternative would be to use OLD heartbeat and hope you don’t meet bugs..

  8. Stuart Duncan
    August 2nd, 2010 at 11:21 | #8

    You are an absolute lifesaver – this is the only workable SAMBA script I’ve found

  9. August 2nd, 2010 at 11:24 | #9

    @Stuart Duncan
    Glad to be of service.

  10. Marco Carlo Spada
    November 17th, 2010 at 16:43 | #10

    Hi everybody,

    I’ve just discovered this blog and I felt down immediatly.

    I’ trying to compile the whole HA suite under Slackware64 (13.1) and I’m working hard since few weeks. I fully agree with you when you say that « …Setting up the pacemaker stack is the hardest thing I have yet had to do in my professional career… ».

    Well I’m now blocked at the Pacemaker ./config because it claims for the Hartbeat utility libraries.

    ======
    configure: error: in `/usr/src/Pacemaker-1-0-Pacemaker-1.0.10′:
    configure: error: Core Heartbeat utility libraries not found: no
    ======

    I tryed the 1.0.10, 1.0.5 versions and the last tip release downladed from Clusterlabs with the same exit error.

    I installed in the order you show:

    1) Cluster Glue
    2) Cluster Resource Agents
    3) Corosync

    avoiding openAIS because they say it’s only needed with multi-access filesystems (that I’m not planning to use).

    Can you suggest me where I have to modify this f*** autoconf?

    thanks in advance

    marco

    • November 17th, 2010 at 17:05 | #11

      @Marco Carlo Spada

      It has been quite a while since I last worked on this, and I’m not familiar with any recent developments however my primary observation is that (last time I looked at it) you needed OpenAIS if you want Corosync because there are a few components in OpenAIS that Corosync needs. They used to be the same thing, but were then split into two.
      I can’t say that authoritatively, but it’s what I would do.

      Also, it sounds silly but I want to make sure that you have actually installed the cluster-glue and resource agents packages into the system before building corosync and you havent only built them but not yet installed them. Easy mistake to make.

      Other than that my best recommendation is that you head to the IRC channel to see if they can point you in the right direction: #linux-ha on FreeNode.

  11. Martin Fox
    February 7th, 2011 at 07:09 | #12

    @Stuart Duncan
    +1
    Many thanks, Martin from Switzerland

  12. Jon
    March 3rd, 2013 at 23:49 | #13

    @Steven Dake
    “Please keep in mind the sources are meant to be compiled by distribution experts”
    You’re kidding, right? I am a distribution expert and even I have trouble keeping up with this mess. I think the maintainers are less than experts else they would have created a solution that was not so fragile and a pain in the ass to get up and running.

  1. September 29th, 2009 at 15:28 | #1
  2. June 4th, 2010 at 10:25 | #2
  3. July 17th, 2012 at 16:11 | #3
  4. September 18th, 2013 at 12:27 | #4

Note: Commenter is allowed to use '@User+blank' to automatically notify your reply to other commenter. e.g, if ABC is one of commenter of this post, then write '@ABC '(exclude ') will automatically send your comment to ABC. Using '@all ' to notify all previous commenters. Be sure that the value of User should exactly match with commenter's name (case sensitive).