Sunday, April 9, 2006

Backhoe 3, Internet 0

Around 17:30 local time of Thursday, April 6th, the 2 Mbit/sec leased line which provides Fourmilab's main Internet connectivity went down. The router detected the failure and, as it is configured to do, established a dial-up ISDN connection to my ISP to serve as a backup. This connection is limited to 128 Kb, and while it's fine for E-mail, DNS, and local access to Web sites, it is hopelessly inadequate for the outbound Web traffic, so as seen from the outside the site, while never actually down, appeared to run so slowly you could barely tell the difference. I subscribe to a service which makes HTTP requests to the site every 15 minutes from a variety of peering points in Europe and the U.S. and prepares a daily response time report. Usually, 90% of these polls complete in less than half a second, with the worst case on the order of three seconds. On “Black Thursday” the maximum response time was more than twenty seconds and the mean response was almost three seconds!


Here is how the outage looked from the load balancer. Instead of the usual mean HTTP request rate between 6 and 8 packets per second, there's a “notch” for the duration of the leased line outage where only about two packets per second were processed. (The earlier down-spike that day is due to my restarting the Apache HTTP daemon to “cycle” the Web access log to a new file; that only causes about a 10 second interruption in service, but the load balancer spots it and marks it on the chart. An outage longer than 15 seconds will cause the load balancer to start directing requests to the hot spare server, but a log cycle isn't long enough to trigger the fail-over.)

The leased line outage was due, it turns out, to two Swisscom 20 Gbit/sec fibre optic backbone cables being cut, which darkened the Internet connectivity of some 30,000 clients, including Fourmilab. (Swisscom is not my ISP, but the line which connects Fourmilab to the ISP's point of presence in Bern is supplied by Swisscom.) Notwithstanding the newspaper story (click the thumbnail for an enlargement), which quotes Swisscom as saying the outage was three hours in duration, in fact connectivity was not restored until about midnight, as you can see from the hit rate chart. This was confirmed by the service desk at my ISP, who said that all their customers with connectivity through the affected cables came back up at the same time. Connaisseurs of technological trouble know that events like this, like quarks and macroscopic spatial dimensions, always come in threes, and won't be surprised to read that almost at the same moment the two fibre cables were cut which took down Fourmilab's connectivity, a third, completely unrelated, cable was cut about five kilometres on the other side of Fourmilab. Swisscom attributed the reports of outages by customers served by this cable to the two they were already working on, and only finally twigged to the actual problem when fixing the first two cables didn't make it go away; some people were cut off from land-line telephone service for as long as 30 hours.

When my original ISP went bankrupt in the aftermath of the dot.bomb, I considered buying half of my bandwidth from each of two different ISPs, with a router configured to load balance as long as both were up. As it turns out, both of the two substantial connectivity outages I've experienced in the last 12 years were due to cables cut by excavations which would have, in all probability, taken down both ISP connections. As they said about the ascent engine on the lunar module, “some things just have to work”.

Posted at April 9, 2006 22:07