Wednesday, November 9, 2005

Hello, Dali II: UPS Meltdown Caught in the Act

Long-term readers of this chronicle will recall the posting in December 2004 about a set of APC UPS batteries which seemingly melted in a way reminiscent of Salvador Dali's painting The Persistence of Memory. This event has remained a mystery in the succeeding months, just one of those things I file away under those events which clutter my life which, when I describe them to others, elicits the comment, "I've never heard of anything like that before" (which, if I had a centime every time I'd heard it, I'd have never had to start a software company!).

As described in the original posting, I replaced the batteries in the UPS, gingerly recycled the partially melted ones, and ever since then that UPS has behaved OK. Then, last week, it happened again, to a different UPS, and this time I caught it in the act! All of the UPS units at Fourmilab are configured to perform a weekly self-test on Monday morning. When I happened to walk by the room in which this UPS was located, I immediately noted an acrid smell (I learned engineering back when the "educated nose" was an important asset for an electrical engineer) and, opening the door and walking in, a powerful stench, the sound of a fan blasting away, and abundant "excess heat". It was obvious the UPS was the source (if only because nothing else in that room had a fan of that size), and, holding my hand near the side of the cabinet, it was clear I'd be wiser not to touch it. I bypassed the UPS, shut it down to let it cool, and unplugged it from the mains.

After things had cooled down and the stench had been dissipated by airing out the room to the great outdoors, I opened up the the UPS and discovered that, as before, the battery was jammed in the battery compartment. I proceeded to dismantle the UPS so as to extricate the battery and found that what jammed it was that it had begun to swell, both along the cells and in the front and back. The swelling was not as extreme as in the first incident (in particular, the top of the battery had not begun to bulge), but it was unambiguous, and almost certainly would have proceeded to a more extreme condition had I not pulled the plug when I did. The swelling along the side of the battery is scarier than it appears in this photo; the dark grey colour of the battery makes it difficult to show the deformation of the case, which clearly follows the cells and plates within them. The battery which manifested these symptoms had been installed on 2003-08-05, and thus was not at all long in the tooth by UPS battery standards.

Absent evidence, I shall speculate. What I think is happening is that there's some kind of failure mode which causes the battery to short during a recharge cycle in a way which causes the UPS to enter a continuous high-current recharge mode. This, applied to a fully charged battery, generates heat which causes the case to melt and bulge and provokes the evolution of acidic vapours through escape vents. This process is limited only by some failure within the battery or charging circuit which causes it to be shut down.

As with the first UPS on which this happened, I have replaced the battery, let the new battery fully charge, and performed a series of self-tests which completed absolutely normally. This is consistent with my suspicion that the problem is fundamentally due to a failure in the battery, but I may be wrong. In any case, whatever the failure mode, it would seem an excellent idea to include a thermal cut-out on battery temperature, or perhaps a computed integrated charging current measure which would shut down charging of the battery before enough energy had been delivered to it which could provoke physical failure (release of acid, hydrogen emission, or deformation of the case which precluded replacement of the battery without dismantling the UPS). This UPS, like the rack-mount unit in which the first meltdown occurred, was one of the first installed at Fourmilab, dating from 1996. I know not whether there is some kind of back-end of the bathtub curve which is provoking these failures in older UPS units, and/or whether newer units may have the kinds of protection against these kinds of distasteful and potentially dangerous failures mentioned above.

