« June 27, 2005 | Main | June 30, 2005 »

Wednesday, June 29, 2005

Single Point Failures

sp_2005-06-29.jpg Having spent more than six months transitioning Fourmilab over to a server farm architecture and, in the process, trying to eliminate as many single points of failure as possible, I'm even more than usually sensitive to unexpected single point failures in the real world. Last week, Switzerland experienced a doozy. If there's one thing you expect about Switzerland, it's that the trains run on time and, in fact, they almost always do. Last Wednesday, June 22nd, however, Switzerland experienced a nation-wide railroad failure which, for about three hours, brought every train in the country to a standstill. Now many people, including me, never even remotely imagined that such a thing could happen, but it did. Apparently, the combination of a lightning strike, two parallel circuits being taken offline for maintenance, and the general star architecture of the 15,000 volt main lines contributed to an outage which never could have happened in the age of steam. Here is an article in English with an early look at the circumstances--a comprehensive investigation is underway. Naturally, when this happened, I had a visitor from the other side of the country who, like the 200,000 people en-route when the trains ground to a halt, was stranded until the system came back up. Fortunately, he was able to catch the last train that night after power was restored. The very next day, a shorter outage brought the trains to a halt in this region, but not all across the country.

Violent thunderstorms are an almost daily occurrence this time of year, and their impact on electrical and telecommunication infrastructure are something one simply must ride out, but it's kind of startling that a single event can bring all the trains to a halt. It's as unimaginable as Swissair going bankrupt . . . oh, right. Anyway, last Monday, June 27th, another lightning strike took down all the electricity in the Canton of Neuchâtel. A strike on a 125,000 volt line (bottom headline) turned out the lights for the entire Canton, including Fourmilab. Power was restored to some regions within about 40 minutes, but the outage here was more than an hour. The Fourmilab Web site was degraded, but did not go down--while the main servers shut down, the "pathfinder" server, which is a re-purposed laptop which can run about three hours on its batteries, took over as a hot spare, and the network gear and load balancers, which can run for hours on their UPSes, continued to route requests to that machine. Since it is a 1 GHz single processor machine, it was hideously overloaded and response was very slow, it did continue to run and serviced DNS and mail requests normally until power was restored. I didn't intend the site to be able to ride out an hour long power outage, but it's nice to know it can.

I suppose the lesson from all this for practical engineering is that if you can't imagine a single point failure in a system, this may be due to your own lack of imagination, not the system's robustness and fault tolerance.

Posted at 22:23 Permalink

Reading List: Round Ireland with a Fridge

Hawks, Tony. Round Ireland with a Fridge. London: Ebury Press, 1998. ISBN 0-09-186777-0.
The author describes himself as "not, by nature" either a drinking or a betting man. Ireland, however, can have a way of changing those particular aspects of one's nature, and so it was that after a night about which little else was recalled, our hero found himself having made a hundred pound bet that he could hitch-hike entirely around the Republic of Ireland in one calendar month, accompanied the entire way by a refrigerator. A man, at a certain stage in his life, needs a goal, even if it is, as this epic quest was described by an Irish radio host, "A totally purposeless idea, but a damn fine one." And the result is this very funny book. Think about it; almost every fridge lives a life circumscribed by a corner of a kitchen--door opens--light goes on--door closes--light goes out (except when the vegetables are having one of their wild parties in the crisper--sssshhh--mustn't let the homeowner catch on). How singular and rare it is for a fridge to experience the freedom of the open road, to go surfing in the Atlantic (chapter 10), to be baptised with a Gaelic name that means "freedom", blessed by a Benedictine nun (chapter 14), be guest of honour at perhaps the first-ever fridge party at an Irish pub (chapter 21), and make a triumphal entry into Dublin amid an army of well-wishers consisting entirely of the author pulling it on a trolley, a radio reporter carrying a mop and an ice cube tray, and an elderly bagpiper (chapter 23). Tony Hawks points out one disadvantage of his profession I'd never thought of before. When one of those bizarre things with which his life and mine are filled comes to pass, and you're trying to explain something like, "No, you see there were squirrels loose in the passenger cabin of the 747", and you're asked the inevitable, "What are you, a comedian?", he has to answer, "Well, actually, as a matter of fact, I am."

A U.S. edition is now available.

Posted at 21:11 Permalink