Wednesday, June 29, 2005

Single Point Failures

sp_2005-06-29.jpg Having spent more than six months transitioning Fourmilab over to a server farm architecture and, in the process, trying to eliminate as many single points of failure as possible, I'm even more than usually sensitive to unexpected single point failures in the real world. Last week, Switzerland experienced a doozy. If there's one thing you expect about Switzerland, it's that the trains run on time and, in fact, they almost always do. Last Wednesday, June 22nd, however, Switzerland experienced a nation-wide railroad failure which, for about three hours, brought every train in the country to a standstill. Now many people, including me, never even remotely imagined that such a thing could happen, but it did. Apparently, the combination of a lightning strike, two parallel circuits being taken offline for maintenance, and the general star architecture of the 15,000 volt main lines contributed to an outage which never could have happened in the age of steam. Here is an article in English with an early look at the circumstances--a comprehensive investigation is underway. Naturally, when this happened, I had a visitor from the other side of the country who, like the 200,000 people en-route when the trains ground to a halt, was stranded until the system came back up. Fortunately, he was able to catch the last train that night after power was restored. The very next day, a shorter outage brought the trains to a halt in this region, but not all across the country.

Violent thunderstorms are an almost daily occurrence this time of year, and their impact on electrical and telecommunication infrastructure are something one simply must ride out, but it's kind of startling that a single event can bring all the trains to a halt. It's as unimaginable as Swissair going bankrupt . . . oh, right. Anyway, last Monday, June 27th, another lightning strike took down all the electricity in the Canton of Neuch√Ętel. A strike on a 125,000 volt line (bottom headline) turned out the lights for the entire Canton, including Fourmilab. Power was restored to some regions within about 40 minutes, but the outage here was more than an hour. The Fourmilab Web site was degraded, but did not go down--while the main servers shut down, the "pathfinder" server, which is a re-purposed laptop which can run about three hours on its batteries, took over as a hot spare, and the network gear and load balancers, which can run for hours on their UPSes, continued to route requests to that machine. Since it is a 1 GHz single processor machine, it was hideously overloaded and response was very slow, it did continue to run and serviced DNS and mail requests normally until power was restored. I didn't intend the site to be able to ride out an hour long power outage, but it's nice to know it can.

I suppose the lesson from all this for practical engineering is that if you can't imagine a single point failure in a system, this may be due to your own lack of imagination, not the system's robustness and fault tolerance.

