« Reading List: Mercury | Main | Reading List: Fab »
Tuesday, December 12, 2006
Applying Check Point FireWall-1 Hotfixes on a Nokia IP265 Network Appliance
(Note: The information in this item is so specialised it is probable that not a single regular reader of this chronicle will find it of interest. Why post it then? Because every time I publish such an item I receive feedback from people who found it with a search engine who write to thank me for pointing them to the solution of the obscure problem in question. That's why I try to give such items long titles with keywords to direct those searching for such information to them.) Last year at about this time Fourmilab's firewall was upgraded to dual redundant diskless Nokia IP265 network appliances which run Check Point FireWall-1 software. The two Nokia machines are configured as an active/backup high availability cluster using the Virtual Router Redundancy Protocol (VRRP) so that if the active firewall fails, the backup, which constantly monitors its status and mirrors connection information, can take over without even dropping active TCP connections. All of this worked pretty much as expected, but unfortunately I soon discovered a horrific bungle in the VRRP fail-over implementation. When the active firewall went down, the backup took over, then relinquished control back to the primary unit once it came back up: all well and good. But if you rebooted the backup, the active firewall would cease to forward traffic until the backup returned to service! The meant your entire “high availability” cluster and access to all of the machines behind it was vulnerable to the failure of the backup firewall—what a mess! After researching this problem for some time, I discovered that Check Point had issued a “hot fix” to correct a problem in which the reboot of a VRRP backup machine would send a bogus gratuitous ARP packet which “could block cluster connectivity”; that certainly sounded like the problem I was having. Check Point periodically releases what they call “Hotfix Accumulators” which are like Sun's omnibus “rollup patches” for Solaris: a large collection of independent patches said to be mutually compatible which, together, constitute a minor release of the software to which they pertain. I downloaded the current such package, which was released in June of 2006 (although its documentation was most recently revised in mid-September 2006) and proceeded to try to install it first on the backup firewall, allowing the primary to remain in production so as not to disrupt access to the site. Because the Nokia IP265 is a flash-based diskless machine, its file system structure is rather curious. It runs an operating system called IPSO, which is based on FreeBSD, and the output of the “df -k” command is as follows:xl5[admin]# df -k Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/wd0f 127151 40525 76454 35% / v9fs 120224 41752 78472 35% /image/IPSO-3.8.1-BUILD028-12.02.2004-222502-1518/rfs /dev/wd0a 31775 33 29200 0% /config /dev/wd0h 317903 154047 138424 53% /preserve procfs 4 4 0 100% /proc v9fs 88712 10240 78472 12% /var mfs:83 7607 0 6998 0% /var/tmp2/upgrade v9fs 174448 95976 78472 55% /optThe file systems on the various partitions of “/dev/wd0” are stored in the non-volatile flash memory. At boot time, they are decompressed and copied into the RAM file systems from which the system runs; changes to the “v9fs” file systems are lost at the next reboot. The Hotfix Accumulator I wished to install was 22.8 Mb GZIP compressed. I decided to copy it to the volatile “/var” file system, with about 80 Mb of free space, for installation. I copied it, decompressed and extracted the archive, and still had what I thought was plenty of free space on /var. How wrong I was. The process of installing one of these updates uses a huge amount of intermediate storage, and the installation script does not bother to check whether there's enough space to complete the task before commencing it. Worse, when it does exhaust the free space on the file system, it just keeps blundering on, truncating files and destroying information, and then, as a final boot in the system administrator's face, reports that everything has completed successfuly. When you reboot the firewall after this process, you begin to appreciate the extent of the damage. Essentially nothing works; your installation consists of about an equal mix of old, new, and truncated files, and since the backup files were lost in the disc full incident, you cannot even reverse the process to restore the status quo ante. After surveying the wreckage, I decided the most expeditous course would be to restore the entire contents of the /preserve/opt/packages/installed directory from the most recent backup. (If you haven't previously installed any patches, you can restore it from the “IPSO Wrapper” for the software version you're using.) After restoring the contents of this directory, I was able to reboot the backup firewall and have it resume its backup rôle running the old version. For the next attempt, I decided to place the update files on the /preserve file system, which is the largest on the machine. Copying them there filled this file system to the 50% level from a starting point of 37%, but that still left far more free space than on /var. The first attempt to install the patch failed, claiming that it was already installed. The first failed attempt had corrupted the registry (shudder) and so I had to edit /preserve/var/opt/CPshared-R55p/registry/HKLM_registry.data and remove the two instances of:
: (HotFixes :HOTFIX_HFA_R55P_08 (1) )left there by the failed installation before the update would apply successfully. During the installation I kept an eye on free space, and at the high-water mark /preserve reached the white-knuckle level of 87% of capacity. But after the installation was complete, it dropped back to just a few percent above where it started. After the installation was complete, I rebooted the backup firewall, halted the primary, and allowed the new version of the software on the backup to enter production. After a day with no problems, I repeated the process to install the update on the primary, restoring the site to a fully redundant configuration. After all of this I was almost afraid to try the obvious test of rebooting the backup to see if the problem which launched me on this adventure had, in fact. been fixed—I'm not sure I could have maintained my composure if all of this had been for nought. But, after all, system administrators are known more for their ill-tempered meat cleavers than even tempered composure, so I went ahead and rebooted the backup and, lo and behold, the problem had indeed been fixed (cue choirs of angels singing hallelujahs). The lesson to take away from this is that when you're installing Check Point Hotfix Accumulators on a Nokia IP265, always place the update directory on the /preserve file system, not one of the others with less capacity. And, of course, be sure you have a complete, current backup of the entire machine (not just the configuration files backed up Nokia Voyager, which do not include the critical package files modified by the Hotfix installation) before attempting the installation.
Posted at December 12, 2006 21:03