« Animal Magnetism: Not a Snake |
Main
| Reading List: No Way to Treat a First Lady »
Monday, December 31, 2007
Fourmilab Server Farm: One Year Uptime
On the last day of 2007, the Fourmilab
server farm reached the milestone of all machines which provide public services (Web, FTP,
HotBits, etc.) having run for one year or more without a reboot or other system-wide service outage. The dual redundant power supplies of the Dell PowerEdge 1850 principal servers allowed the swap-out of an Uninterruptible Power Source (UPS) which failed to live up to its name without the need to shut down the servers to which it provided partial power.
Host name |
Function |
Uptime as of 2007-12-31 |
server1 |
Active public server |
365 days, 10:31 hours |
server0 |
Backup public server |
722 days, 17:03 hours |
server3 |
Test/administration server |
378 days, 12:57 hours |
hotbits0 |
HotBits generator 0 |
463 days, 10:12 hours |
hotbits1 |
HotBits generator 1 |
428 days, 22:42 hours |
Some people berate sites which rack up lengthy uptime records, claiming that this indicates neglect of preventive software maintenance, in particular keeping systems up to “current patch level”. Now, this is largely an instance of intellectual corruption due to Microsoft, where updating a music player requires rebooting a running system, but some Linux users also assume that frequent kernel updates and reboots to install them are essential for a secure system. Fourmilab's philosophy is different—on server farm machines, essentially the only component from the Linux software distribution used in the critical path is the kernel. Everything else: Web, FTP, mail, DNS, and other servers are built from source which resides in the server's private “
/server” partition and, in keeping with the Unix tradition, any of these components can be updated as required simply by restarting it—no system reboot is required.
When a security or other update to one of the public server packages is released, I build it from source and test it on
server3, the “Test/administration server”, which is actually a 6 year old laptop with a software configuration identical to the production servers. After testing, the update is deployed on the active and backup production servers with
rdist, then put into production by restarting the server process on these machines; the interruption to public requests due to such a restart is on the order of one second. I generally install server updates on the active server first and leave the previous version on the backup server until I'm confident the new release is working well. That way, should the update crash or otherwise become nonresponsive under the real-world load, the load balancer will automatically fail over to the previous version running on the backup server.
The
Fourmilab firewall is configured to only allow packets from the Internet to reach server farm machines on the ports on which these locally-built server processes listen; all other incoming traffic is discarded, so potentially vulnerable components from the Linux distribution, even if they were listening on some port, cannot be accessed from off-site by would-be attackers.
Posted at December 31, 2007 16:45