« Reading List: Friendship 7 | Main | JavaScrypt Compatibility Fix for Mozilla Firefox 1.5 »
Sunday, December 11, 2005
New Firewall in Production at Fourmilab
On Thursday, December 8th at about 15:45 UTC, Fourmilab's Internet connection was switched over to a new firewall configuration. This followed several short-duration live tests in which problems were identified to be fixed in offline testing on a "toy network" with a retired laptop simulating the external Internet. It seems to be in the nature of these projects that, although embarked upon with great hope and enthusiasm, as the time of cut-over to live production approaches, one is afflicted with affright and apprehension. This is particularly the case with the firewall, because once it's installed, to restore full functionality it is necessary to change the IP address of every machine on the local network, and even remembering how to do that on some of the ancient hardware in production here (for example, a Sun SPARCstation 2 with a serial number of less than 10,000 running SunOS 5.5) can be a challenge. Since one of the devices on the local network is a printer shared by all the machines, every machine's printer configuration must be changed as well. All of this means that if the new firewall collapses in some way which requires reverting to the old one, there's a lot more involved than just swapping a few cables. By late last week, I had exhausted all pretexts for further cunctation, so there was nothing for it but to take a deep breath, throw the switch, and see what happened.
The two firewalls, named XL5 and XL6 (yes, I'm a big Gerry Anderson fan, and have every episode on DVD) use the Virtual Router Redundancy Protocol (VRRP) to provide full redundancy. Firewall XL5 is designated the primary and normally responds to the virtual IP addresses of the firewall cluster. If it crashes and ceases to respond correctly to heartbeats from the backup, XL6, or if it detects a fault (for example, loss of link on one of its mission-critical network interfaces) and declares itself down, XL6 immediately takes over the virtual IP addresses and becomes primary. This isn't just a "hot spare" configuration--it's a "sizzling spare" because even while XL6 is serving as the backup, its copy of the firewall software receives continuous state synchronisation messages from XL5, so a fail-over from primary to backup firewall does not interrupt active TCP connections; this can be a real lifesaver if you're running a lengthy full backup from a server on the DMZ to a backup host on the LAN when a hiccup occurs. The propensity of the "3Con" 3CR16110-95 firewalls I'm replacing for crashing during large, high-speed data transfers among networks, combined with their dropping all TCP connections when a fail-over occurs, made full backups a nightmare, forced me to write Valve, and was the primary motivation for the present firewall migration project. The internal network has been reconfigured into physically distinct LAN, DMZ, and external segments. The LAN and DMZ machines are given addresses on the private networks 10.1.x.x and 10.2.x.x respectively, and only servers and a few special purpose machines with specific needs have addresses visible from the outside. All other machines "hide" behind a single address with NAT, and cannot accept connections of any kind initiated from the outside. The only devices connected to the external network are the router on the leased line, the firewalls, and an ePowerSwitch I can use in extremis to power cycle firewalls and switches when I'm off site. The above configuration sounds pretty simple, but when you combine some of the odd things which are done around here with the Byzantine complexity of the Check Point firewall software, which occasionally brings to mind Joe Costello's remark about CATIA: "I've never met a human being who would want to read 17,000 pages of documentation, and if there was, I'd kill him to get him out of the gene pool.", you end up with 18 firewall rules, 27 NAT table entries, and two months of development and pre-production testing. Apart from a few minor speed bumps, however, everything has gone smoothly so far and I anticipate moving the server and load balancer presently held in reserve in case a fall-back is required to the new firewall early next week, restoring full redundancy to the site before my absence during the holidays.
Posted at December 11, 2005 20:56