Valley of the Dells: RAID Firmware Upgrade (Fourmilog: None Dare Call It Reason)

« Linux: Fedora X11 6.8.2-1.FC3.13 on Multiprocessor Systems | Main | Server Farm Status »

Saturday, April 9, 2005

Valley of the Dells: RAID Firmware Upgrade

Ever since the Fourmilab server farm was put into production, the only serious problem has been random outages on servers in which all subsequent writes to filesystems fail. This seems to be provoked by conditions of heavy load which generate large volumes of I/O--the first time I encountered it was before the server was put into production (and hence essentially idle) when I tried to copy a 4 Gb archive from one filesystem to another. In production, these crashes occur at random intervals anywhere from a few days to more than a week apart.

It is amazing how well a Linux system will run when all filesystem writes return I/O errors! In fact, it runs sufficiently well to fool the load balancer into thinking the system is up, but not well enough to actually service HTTP, FTP, and SMTP traffic. I added a custom "server agent" the load balancer probes which not only returns an exponentially smoothed moving average of the server load, but also verifies that all server-critical filesystems are writable and reports the system down if one or more isn't. This causes a server in the "can't write to disc" failure mode to be removed from the available pool and reported down, but of course doesn't remedy the actual problem which only seems to be cured by a reboot (which will be messy, since filesystems can't be cleanly unmounted if they can't be written).

When you're running with ext3 filesystems, the symptom of this failure is "journal write errors". This caused me to originally suspect the problem lay in the ext3 filesystem itself, as I noted here and here. After converting back to ext2, in a fine piece of misdirection, the problem didn't manifest itself for a full ten days, then it struck again--twice within eight hours. Since recovery from the error truncates the log file containing the messages reporting it, and usually the original messages have scrolled off the console window, I'd not actually seen the onset of the the problem until it happened right in front of my eyes yesterday. Here's what it looks like:

12:37:31: megaraid: aborting-227170 cmd=28 <c=1 t=0 l=0>
12:37:31: megaraid abort: 227170:57[255:0], fw owner
12:37:31: megaraid: aborting-227171 cmd=2a <c=1 t=0 l=0>
12:37:31: megaraid abort: 227171:62[255:0], fw owner
12:37:31: megaraid: aborting-227172 cmd=2a <c=1 t=0 l=0>
12:40:31: megaraid abort: 227172[255:0], driver owner
12:40:32: megaraid: aborting-227173 cmd=28 <c=1 t=0 l=0>
12:40:32: megaraid abort: 227173[255:0], driver owner
12:40:32: megaraid: reseting the host...
12:40:32: megaraid: 2 outstanding commands. Max wait 180 sec
12:40:32: megaraid mbox: Wait for 2 commands to complete:180
12:40:32: megaraid mbox: Wait for 2 commands to complete:175
       . . . counts down to zero . . . 
12:40:32: megaraid mbox: Wait for 2 commands to complete:5
12:40:32: megaraid mbox: Wait for 2 commands to complete:0
12:40:32: megaraid mbox: critical hardware error!
12:40:32: megaraid: reseting the host...
12:40:32: megaraid: hw error, cannot reset
12:40:32: megaraid: reseting the host...
12:40:32: megaraid: hw error, cannot reset
12:40:32: scsi: Device offlined - not ready after error
          recovery: host 0 channel 1 id 0 lun 0
12:40:32: scsi: Device offlined - not ready after error
          recovery: host 0 channel 1 id 0 lun 0
12:40:32: SCSI error : <0 1 0 0> return code = 0x6000000
12:40:32: end_request: I/O error, dev sda, sector 57443971
12:40:32: SCSI error : <0 1 0 0> return code = 0x6000000
12:40:32: end_request: I/O error, dev sda, sector 270776074
12:40:32: scsi0 (0:0): rejecting I/O to offline device
12:40:32: end_request: I/O error, dev sda, sector 130580362
12:40:32: scsi0 (0:0): rejecting I/O to offline device
12:40:32: SCSI error : <0 1 0 0> return code = 0x6000000
12:40:32: end_request: I/O error, dev sda, sector 57224203
12:40:32: Buffer I/O error on device sda6, logical block 622595
12:40:32: lost page write due to I/O error on sda6
12:40:32 server0 kernel: scsi0 (0:0): rejecting I/O to offline device
12:40:32 server0 last message repeated 14 times
12:40:32: IO error syncing ext2 inode [sda6:0004a3a2]

I have elided the date, "Apr 8", and the identification "server0:kernel" from these messages to make them fit on the page, and wrapped a couple of long ones. Note that this entire train wreck occurs within the space of about one second. It continues in this vein until you reboot. Once in this state, none of the clean shutdown alternatives work--I have to force a reboot from the Remote Access Controller or else power cycle the server.

This log information heightened my suspicion (already expressed by other Dell system administrators on various Dell/Linux discussion fora) that what we are dealing with is an inherent flaw in either the Dell PERC 4e/Si (PowerEdge RAID Controller) or the SCSI adaptor to which the discs it manages are attached. Since numerous other users of Dell PowerEdge 1850 (and its taller sibling, the 2850) reported these I/O hangs under intense load on a variety of operating systems, I was pretty confident the problem was generic to the hardware and not something particular to my machine or configuration.

Searching further based on the log entries, I came upon an announcement of a firmware update to the RAID controller, listed as "Criticality: Urgent" and described as follows:

Increased memory timing margins to address the following potential symptoms: System hangs during operating system installs or other heavy I/O, Windows Blue Screens referencing "Kernel Data Inpage Errors" or Linux Megaraid timeouts/errors.

You could hardly ask for a clearer description of the symptoms than that! I downloaded the new firmware, and after some twiddling to permit the firmware upgrade to be installed from the RAM filesystem of the Fedora Core 3 Rescue CD, re-flashed the RAID controller in the backup "pathfinder" server with no problems. After it had run for about 10 hours without incident, I installed the new firmware on the front-line server as well. Will this finally fix the I/O hang under heavy load? We'll see . . . . If all goes well for a week or so, I'll convert the non-root filesystems back to ext3, which I consider exonerated (although perhaps a precipitating cause due to increased I/O traffic), but three decades of system administration counsel against changing more than one thing at a time.

Posted at April 9, 2005 09:56

Fourmilog: None Dare Call It Reason

John Walker's Fourmilab Change Log

Saturday, April 9, 2005

Valley of the Dells: RAID Firmware Upgrade