Heisenbug

by John Walker


xkcd: Location sharing
Cartoon by xkcd.com used under Creative Commons Attribution-NonCommercial 2.5 License.

The year was 1974. Computers looked like computers, just as you'd expect from the movies. Below is the maintenance panel of a Univac 1110 computer, like the one I worked on at the time. Operators interacted with the computer through a separate console, with a CRT terminal and page printer, but the maintenance panel provided access to the low-level circuitry of the computer, allowing hardware maintenance engineers and systems programmers (like me, who worked on the operating system at a low level) to examine its operation in binary, stepping the multi-million dollar machine instruction-by-instruction through programs late at night while paying customers were asleep.

Univac 1110 maintenance panel

The indicators at the right showed the most basic registers of the computer such as the program counter and processor state register, while the knobs at the left allowed selecting a variety of internal registers to be displayed or modified, with a legend above the binary display mechanically rotated to label the bits in the selection. This was a 36-bit word machine, so the registers were of that length.

When the computer was running its normal work load, the maintenance console, while in view from the operator's console, displayed what was essentially a random pattern of lights: the registers were changing so rapidly they were averaged out into a blur.

Univac 1110 system

That wouldn't do. I was a systems programmer! This machine was what we now call a symmetric multiprocessor: it had two central processing units (CPUs) which shared access to common memory. In single processor systems, it was usual that when the system was idle it would simply while away the time in an infinite loop until interrupted when work arrived, but this was a poor choice for a multiprocessor: retrieving the infinite loop instruction over and over from memory would impede other processors and input/output devices from accessing it. (Today's computers would keep such an instruction in a CPU cache local to the processor, but this was the 1970s, where such extravagances were like flying cars.)

Fortunately, the Univac 1110 had an instruction called “Block Transfer”, intended to move memory in bulk from one location to another with a single instruction. Further, this instruction could be (ab)used to move data from CPU registers to other registers, never accessing memory at all. By replacing the (“here: go to here;”) instruction in the idle loop with a block transfer, an idle CPU would make no memory accesses at all, and thus not interfere with other CPUs or input/output operations.

This was cool, and I considered myself a Knight of Efficiency for implementing it. But then I observed something else. While the block transfer instruction was executing, the value in the register it was transferring was displayed in one of the rows of lights on the maintenance panel, so long as the rotary switch was set to view it. Now this suggested something even cooler: since I could display any pattern I wished in the lights, why not do something interesting like bits which went zorp-zorp back and forth across the field. This would only be displayed when the CPU was idle, and the speed the bits moved when visible indicated the percentage of idle time, and hence the inverse of the load on the system. I called it “The Speedometer”. It was an immediate hit, and other Univac sites adopted it.

But all was not well at its home site. After installation of the speedometer, the system started to crash more frequently than it had previously. (Although, at that time, reliability was such that it was difficult to tell the difference.) One property of these grand 1970s timesharing systems which has been forgotten by users of personal computers is that when the music stopped—the system went down—you could have more than three thousand people furious with you all at the same time. This was not a good place to be, especially when your cool hack for the blinky lights seemed to be culpable.

You could, and I did, look at the speedometer code in great detail, and find there was nothing which could explain the crashes. Further, none of the other sites which had installed the speedometer were experiencing these crashes.

What was going on?

It turns out this was a Heisenbug, a problem which manifested itself depending upon whether it was being observed. The name is derived from Heisenberg's uncertainty principle in quantum mechanics, according to which the results of a measurement depend upon which experiment the observer chooses to make. In this case, in order to observe the speedometer on the maintenance panel, the rotor on the left side of the panel had to be set to view an internal register in the CPU which held the value in the block transfer. In order to display the bits, the circuitry in the panel imposed a load upon this internal register which, because it contained a marginal component, caused it to randomly fail. When a different register was displayed (as maintenance personnel did when performing diagnostics on the CPU), the problem did not occur. Thus the problem was not caused by the speedometer, but rather the selection of the display which allowed it to be observed. The weak circuit which failed when its state was monitored by the maintenance panel display was eventually identified and replaced. In electrical engineering, a circuit's behaving differently when observed with a measuring instrument is called a probe effect, and has been infuriating people for more than a century. My reputation was rehabilitated until the next outrage.

What can you learn from this? Observation affects what you measure. Always blame the systems programmer. Sometimes it really is the hardware. The 1970s were fully as awful as you've imagined them to be.


UNIVAC has been, over the years, a registered trademark of Eckert-Mauchly Computer Corporation, Remington Rand Corporation, Sperry Rand Corporation, Sperry Corporation, and Unisys Corporation.

by John Walker
August 10, 2016