« April 8, 2005 | Main | April 10, 2005 »

Saturday, April 9, 2005

Server Farm Status

I've been writing so much about the server farm recently it's high time I showed a picture of its present state of development. (Click the image for an enlargement in a separate window.)

Fourmilab server farm 2005-04-09 The two boxes at the bottom are the Dell PowerEdge 1850 servers which host the site. Each has dual Intel Xeon 3.6 GHz hyper-threaded processors, which gives each server the equivalent of four CPUs. Each has 8 Gb of ECC RAM, dual 146 Gb 10,000 RPM SCSI drives on an embedded RAID controller, and two Gigabit Ethernet interfaces, which are "bonded" into a single logical interface, with each physical interface connected to one of the two 16 port Dell PowerConnect 2616 Gigabit Ethernet switches at the top of the rack. The interface to switch connections of the two servers are crossed with respect to one another. The two switches are connected together and normally forward packets to one another; each is connected to the DMZ port of one of the two redundant firewalls (which aren't in this rack, but in the communications rack upstairs).

Between the servers and switches are two identical Coyote Point Equalizer 350 load balancers run in primary/backup high availability mode. The top load balancer is connected to the top switch and the bottom load balancer to the bottom switch. Hence, they exchange heartbeats through the interconnected switches, so if one switch goes down, whichever load balancer is connected to the remaining switch will become primary, and since each server has an interface connected to both switches, it will continue to be able to communicate to both servers.

The rack is 24 units high, and all the components are designed to permit dense packing, but I've spaced them out both because it makes them easier to work on and remove if necessary, and also because additional breathing room can't hurt cooling. It isn't obvious from this picture, but the rack is deep--73.5 cm from front to back rails and just a tad less than one metre for the entire cabinet; the Dell servers are 1U high, but they just keep on coming in the depth dimension.

The load balancers are less than 50 cm deep, so I've exploited the unused space by mounting two 15 socket outlet strips on the back rails, one behind each load balancer. These are plugged into independent APC SmartUPS 1500 units which sit on the floor behind the rack, fed from separate dedicated 10 A 230 V circuits with slow-blow thermal fuses. (Never plug a UPS or any other equipment with a big iron-core transformer into a circuit with a fast-trip breaker. The inrush current after even a momentary power blip may pop the breaker and bring your holiday to an unexpected end. This has happened to me.) The servers have dual redundant power supplies and each is plugged into both outlet strips, while the other pairs of components have one of each plugged into each strip. The UPS units are not mounted in the rack due to my earlier surreal adventure with a rack-mounted UPS. The UPS monitoring and control port of each UPS is connected to the serial port of one of the two servers; there is presently no broadcast shutdown, but since each UPS can handle the load of both servers and each server can run on one of its two power supplies, this expedient gets the job done, albeit inelegantly. The servers run the Apcupsd monitoring and control software. The load balancers, which are really FreeBSD machines, also would like to be shut down cleanly before the power goes down, but at the moment that's something which remains on my to-do list.

Posted at 15:50 Permalink

Valley of the Dells: RAID Firmware Upgrade

Ever since the Fourmilab server farm was put into production, the only serious problem has been random outages on servers in which all subsequent writes to filesystems fail. This seems to be provoked by conditions of heavy load which generate large volumes of I/O--the first time I encountered it was before the server was put into production (and hence essentially idle) when I tried to copy a 4 Gb archive from one filesystem to another. In production, these crashes occur at random intervals anywhere from a few days to more than a week apart.

It is amazing how well a Linux system will run when all filesystem writes return I/O errors! In fact, it runs sufficiently well to fool the load balancer into thinking the system is up, but not well enough to actually service HTTP, FTP, and SMTP traffic. I added a custom "server agent" the load balancer probes which not only returns an exponentially smoothed moving average of the server load, but also verifies that all server-critical filesystems are writable and reports the system down if one or more isn't. This causes a server in the "can't write to disc" failure mode to be removed from the available pool and reported down, but of course doesn't remedy the actual problem which only seems to be cured by a reboot (which will be messy, since filesystems can't be cleanly unmounted if they can't be written).

When you're running with ext3 filesystems, the symptom of this failure is "journal write errors". This caused me to originally suspect the problem lay in the ext3 filesystem itself, as I noted here and here. After converting back to ext2, in a fine piece of misdirection, the problem didn't manifest itself for a full ten days, then it struck again--twice within eight hours. Since recovery from the error truncates the log file containing the messages reporting it, and usually the original messages have scrolled off the console window, I'd not actually seen the onset of the the problem until it happened right in front of my eyes yesterday. Here's what it looks like:

12:37:31: megaraid: aborting-227170 cmd=28 <c=1 t=0 l=0>
12:37:31: megaraid abort: 227170:57[255:0], fw owner
12:37:31: megaraid: aborting-227171 cmd=2a <c=1 t=0 l=0>
12:37:31: megaraid abort: 227171:62[255:0], fw owner
12:37:31: megaraid: aborting-227172 cmd=2a <c=1 t=0 l=0>
12:40:31: megaraid abort: 227172[255:0], driver owner
12:40:32: megaraid: aborting-227173 cmd=28 <c=1 t=0 l=0>
12:40:32: megaraid abort: 227173[255:0], driver owner
12:40:32: megaraid: reseting the host...
12:40:32: megaraid: 2 outstanding commands. Max wait 180 sec
12:40:32: megaraid mbox: Wait for 2 commands to complete:180
12:40:32: megaraid mbox: Wait for 2 commands to complete:175
       . . . counts down to zero . . . 
12:40:32: megaraid mbox: Wait for 2 commands to complete:5
12:40:32: megaraid mbox: Wait for 2 commands to complete:0
12:40:32: megaraid mbox: critical hardware error!
12:40:32: megaraid: reseting the host...
12:40:32: megaraid: hw error, cannot reset
12:40:32: megaraid: reseting the host...
12:40:32: megaraid: hw error, cannot reset
12:40:32: scsi: Device offlined - not ready after error
          recovery: host 0 channel 1 id 0 lun 0
12:40:32: scsi: Device offlined - not ready after error
          recovery: host 0 channel 1 id 0 lun 0
12:40:32: SCSI error : <0 1 0 0> return code = 0x6000000
12:40:32: end_request: I/O error, dev sda, sector 57443971
12:40:32: SCSI error : <0 1 0 0> return code = 0x6000000
12:40:32: end_request: I/O error, dev sda, sector 270776074
12:40:32: scsi0 (0:0): rejecting I/O to offline device
12:40:32: end_request: I/O error, dev sda, sector 130580362
12:40:32: scsi0 (0:0): rejecting I/O to offline device
12:40:32: SCSI error : <0 1 0 0> return code = 0x6000000
12:40:32: end_request: I/O error, dev sda, sector 57224203
12:40:32: Buffer I/O error on device sda6, logical block 622595
12:40:32: lost page write due to I/O error on sda6
12:40:32 server0 kernel: scsi0 (0:0): rejecting I/O to offline device
12:40:32 server0 last message repeated 14 times
12:40:32: IO error syncing ext2 inode [sda6:0004a3a2]
I have elided the date, "Apr 8", and the identification "server0:kernel" from these messages to make them fit on the page, and wrapped a couple of long ones. Note that this entire train wreck occurs within the space of about one second. It continues in this vein until you reboot. Once in this state, none of the clean shutdown alternatives work--I have to force a reboot from the Remote Access Controller or else power cycle the server.

This log information heightened my suspicion (already expressed by other Dell system administrators on various Dell/Linux discussion fora) that what we are dealing with is an inherent flaw in either the Dell PERC 4e/Si (PowerEdge RAID Controller) or the SCSI adaptor to which the discs it manages are attached. Since numerous other users of Dell PowerEdge 1850 (and its taller sibling, the 2850) reported these I/O hangs under intense load on a variety of operating systems, I was pretty confident the problem was generic to the hardware and not something particular to my machine or configuration.

Searching further based on the log entries, I came upon an announcement of a firmware update to the RAID controller, listed as "Criticality: Urgent" and described as follows:

Increased memory timing margins to address the following potential symptoms: System hangs during operating system installs or other heavy I/O, Windows Blue Screens referencing "Kernel Data Inpage Errors" or Linux Megaraid timeouts/errors.
You could hardly ask for a clearer description of the symptoms than that! I downloaded the new firmware, and after some twiddling to permit the firmware upgrade to be installed from the RAM filesystem of the Fedora Core 3 Rescue CD, re-flashed the RAID controller in the backup "pathfinder" server with no problems. After it had run for about 10 hours without incident, I installed the new firmware on the front-line server as well. Will this finally fix the I/O hang under heavy load? We'll see . . . . If all goes well for a week or so, I'll convert the non-root filesystems back to ext3, which I consider exonerated (although perhaps a precipitating cause due to increased I/O traffic), but three decades of system administration counsel against changing more than one thing at a time.

Posted at 09:56 Permalink