« Reading List: Friendship 7 | Main | JavaScrypt Compatibility Fix for Mozilla Firefox 1.5 »

Sunday, December 11, 2005

New Firewall in Production at Fourmilab

On Thursday, December 8th at about 15:45 UTC, Fourmilab's Internet connection was switched over to a new firewall configuration. This followed several short-duration live tests in which problems were identified to be fixed in offline testing on a "toy network" with a retired laptop simulating the external Internet. It seems to be in the nature of these projects that, although embarked upon with great hope and enthusiasm, as the time of cut-over to live production approaches, one is afflicted with affright and apprehension. This is particularly the case with the firewall, because once it's installed, to restore full functionality it is necessary to change the IP address of every machine on the local network, and even remembering how to do that on some of the ancient hardware in production here (for example, a Sun SPARCstation 2 with a serial number of less than 10,000 running SunOS 5.5) can be a challenge. Since one of the devices on the local network is a printer shared by all the machines, every machine's printer configuration must be changed as well. All of this means that if the new firewall collapses in some way which requires reverting to the old one, there's a lot more involved than just swapping a few cables. By late last week, I had exhausted all pretexts for further cunctation, so there was nothing for it but to take a deep breath, throw the switch, and see what happened.

New firewall test configuration Here is the configuration as it presently stands, or rather sits on the floor in front of the communications rack to permit easy switching of patch cables back and forth. (Click the image to view an enlargement.) The new firewall and the hardware used to build the toy network for testing it sit atop the cardboard box. At the bottom of the stack are the twin Nokia IP265 security appliances, mounted side by side in a single 1U rack chassis. The IP265 runs an operating system derived from FreeBSD which Nokia calls IPSO, under which the Check Point VPN-1 NG firewall software runs. The Nokia boxes use flash memory instead of a hard drive; the only moving part is the fan. Each has four 10/100 full-duplex Ethernet interfaces, permitting them to support physically separated external, LAN (inside), and DMZ (server farm) networks, while using the fourth network interface for heartbeat and state synchronisation between the active and backup firewalls. Atop the firewalls are the network hubs and switches--mostly retired or spare gear--that were used to set up the toy networks for testing and are now connected to production machines by the cables running up to the patch panel. An old wireless access point on the floor to the right of the box permitted testing of its interaction with the DHCP server running on the firewalls. The laptop on the table simulated a server on the DMZ during the test period. Visible on screen is this spreadsheet, which I've been using to monitor mean HTTP server hits per second on an hourly basis for a week before and after (cells with a blue background) cutover to the new firewall. (There has been a modest decline in hit rate over this period which has nothing to do with the new firewall: I have been deploying increasingly effective versions of Gardol to detect and block packets from the ever growing tsunami of referrer pollution attacks, and dumping the packets from these bozos into the bit bucket before they hit the HTTP server reduces the hit rate computed from its log file.)

The two firewalls, named XL5 and XL6 (yes, I'm a big Gerry Anderson fan, and have every episode on DVD) use the Virtual Router Redundancy Protocol (VRRP) to provide full redundancy. Firewall XL5 is designated the primary and normally responds to the virtual IP addresses of the firewall cluster. If it crashes and ceases to respond correctly to heartbeats from the backup, XL6, or if it detects a fault (for example, loss of link on one of its mission-critical network interfaces) and declares itself down, XL6 immediately takes over the virtual IP addresses and becomes primary. This isn't just a "hot spare" configuration--it's a "sizzling spare" because even while XL6 is serving as the backup, its copy of the firewall software receives continuous state synchronisation messages from XL5, so a fail-over from primary to backup firewall does not interrupt active TCP connections; this can be a real lifesaver if you're running a lengthy full backup from a server on the DMZ to a backup host on the LAN when a hiccup occurs. The propensity of the "3Con" 3CR16110-95 firewalls I'm replacing for crashing during large, high-speed data transfers among networks, combined with their dropping all TCP connections when a fail-over occurs, made full backups a nightmare, forced me to write Valve, and was the primary motivation for the present firewall migration project.

The internal network has been reconfigured into physically distinct LAN, DMZ, and external segments. The LAN and DMZ machines are given addresses on the private networks 10.1.x.x and 10.2.x.x respectively, and only servers and a few special purpose machines with specific needs have addresses visible from the outside. All other machines "hide" behind a single address with NAT, and cannot accept connections of any kind initiated from the outside. The only devices connected to the external network are the router on the leased line, the firewalls, and an ePowerSwitch I can use in extremis to power cycle firewalls and switches when I'm off site.

The above configuration sounds pretty simple, but when you combine some of the odd things which are done around here with the Byzantine complexity of the Check Point firewall software, which occasionally brings to mind Joe Costello's remark about CATIA: "I've never met a human being who would want to read 17,000 pages of documentation, and if there was, I'd kill him to get him out of the gene pool.", you end up with 18 firewall rules, 27 NAT table entries, and two months of development and pre-production testing. Apart from a few minor speed bumps, however, everything has gone smoothly so far and I anticipate moving the server and load balancer presently held in reserve in case a fall-back is required to the new firewall early next week, restoring full redundancy to the site before my absence during the holidays.

Posted at December 11, 2005 20:56