« Reading List: Always Another Dawn | Main | Reading List: Wrench and Claw »
Sunday, November 3, 2019
Reading List: Sunburst and Luminary
- Eyles, Don. Sunburst and Luminary. Boston: Fort Point Press, 2018. ISBN 978-0-9863859-3-3.
-
In 1966, the author graduated from Boston University with a
bachelor's degree in mathematics. He had no immediate job
prospects or career plans. He thought he might be interested
in computer programming due to a love of solving puzzles, but he
had never programmed a computer. When asked, in one of numerous
job interviews, how he would go about writing a program to
alphabetise a list of names, he admitted he had no idea. One
day, walking home from yet another interview, he passed an
unimpressive brick building with a sign identifying it as
the “MIT Instrumentation Laboratory”. He'd heard
a little about the place and, on a lark, walked in and asked
if they were hiring. The receptionist handed him a long
application form, which he filled out, and was then immediately
sent to interview with a personnel officer. Eyles was
amazed when the personnel man seemed bent on persuading
him to come to work at the Lab. After reference checking, he
was offered a choice of two jobs: one in the “analysis
group” (whatever that was), and another on the team
developing computer software for landing the Apollo Lunar
Module (LM) on the Moon. That sounded interesting, and the job
had another benefit attractive to a 21 year old just
graduating from university: it came with deferment from the
military draft, which was going into high gear as U.S.
involvement in Vietnam deepened.
Near the start of the Apollo project, MIT's Instrumentation
Laboratory, led by the legendary “Doc”
Charles
Stark Draper, won a sole source contract to design and
program the guidance system for the
Apollo spacecraft, which came to be known as the
“Apollo
Primary Guidance, Navigation, and Control System”
(PGNCS, pronounced “pings”). Draper and his
laboratory had pioneered inertial guidance systems for aircraft,
guided missiles, and submarines, and had in-depth expertise in
all aspects of the challenging problem of enabling the Apollo
spacecraft to navigate from the Earth to the Moon, land on the
Moon, and return to the Earth without any assistance from
ground-based assets. In a normal mission, it was expected that
ground-based tracking and computers would assist those on board
the spacecraft, but in the interest of reliability and
redundancy it was required that completely autonomous
navigation would permit accomplishing the mission.
The Instrumentation Laboratory developed an integrated system
composed of an
inertial
measurement unit consisting of gyroscopes
and accelerometers that provided a stable reference from which the
spacecraft's orientation and velocity could be determined, an
optical telescope which allowed aligning the inertial platform
by taking sightings on fixed stars, and an
Apollo
Guidance Computer (AGC), a general purpose digital computer which
interfaced to the guidance system, thrusters and engines on
the spacecraft, the astronauts' flight controls, and mission
control, and was able to perform the complex calculations for
en route maneuvers and the unforgiving lunar landing process in
real time.
Every Apollo lunar landing mission carried two AGCs: one in the
Command Module and another in the Lunar Module. The computer
hardware, basic operating system, and navigation support
software were identical, but the mission software was customised
due to the different hardware and flight profiles of the Command
and Lunar Modules. (The commonality of the two computers proved
essential in getting the crew of Apollo 13 safely back to Earth
after an explosion in the Service Module cut power to the
Command Module and disabled its computer. The Lunar Module's AGC
was able to perform the critical navigation and guidance
operations to put the spacecraft back on course for an Earth
landing.)
By the time Don Eyles was hired in 1966, the hardware design of
the AGC was largely complete (although a revision, called Block II,
was underway which would increase memory capacity and add some
instructions which had been found desirable during the initial
software development process), the low-level operating system and
support libraries (implementing such functionality as fixed
point arithmetic, vector, and matrix computations), and a
substantial part of the software for the Command Module had been
written. But the software for actually landing on the Moon,
which would run in the Lunar Module's AGC, was largely just a
concept in the minds of its designers. Turning this into
hard code would be the job of Don Eyles, who had never written
a line of code in his life, and his colleagues. They seemed
undaunted by the challenge: after all, nobody knew
how to land on the Moon, so whoever attempted the task would
have to make it up as they went along, and they had access, in
the Instrumentation Laboratory, to the world's most experienced
team in the area of inertial guidance.
Today's programmers may be amazed it was possible to get
anything at all done on a machine with the capabilities of the
Apollo Guidance Computer, no less fly to the Moon and land
there. The AGC had a total of 36,864 15-bit words of read-only
core
rope memory, in which every bit was hand-woven to the
specifications of the programmers. As read-only memory,
the contents were completely fixed: if a change was
required, the memory module in question (which was
“potted” in a plastic compound) had to be
discarded and a new one woven from scratch. There was
no way to make “software patches”.
Read-write storage was limited to 2048 15-bit words of
magnetic
core memory. The read-write memory was non-volatile: its
contents were preserved across power loss and restoration.
(Each memory word was actually 16 bits in length, but one bit
was used for parity checking to detect errors and not accessible
to the programmer.) Memory cycle time was 11.72 microseconds.
There was no external bulk storage of any kind (disc, tape, etc.):
everything had to be done with the read-only and read-write
memory built into the computer.
The AGC software was an example of “real-time
programming”, a discipline with which few contemporary
programmers are acquainted. As opposed to an “app”
which interacts with a user and whose only constraint on how
long it takes to respond to requests is the user's patience,
a real-time program has to meet inflexible constraints in
the real world set by the laws of physics, with failure
often resulting in disaster just as surely as hardware
malfunctions. For example, when the Lunar Module is descending
toward the lunar surface, burning its descent engine to brake
toward a smooth touchdown, the LM is perched atop
the thrust vector of the engine just like a pencil balanced
on the tip of your finger: it is inherently unstable, and
only constant corrections will keep it from tumbling over
and crashing into the surface, which would be bad. To prevent
this, the Lunar Module's AGC runs a piece of software called
the digital autopilot (DAP) which, every tenth of a second, issues
commands to steer the descent engine's nozzle to keep the Lunar
Module pointed flamy side down and adjusts the thrust to
maintain the desired descent velocity (the thrust must be
constantly adjusted because as propellant is burned, the mass of
the LM decreases, and less thrust is needed to maintain
the same rate of descent). The AGC/DAP absolutely must
compute these steering and throttle commands and send them to
the engine every tenth of a second. If it doesn't, the Lunar
Module will crash. That's what real-time computing is all about:
the computer has to deliver those results in real time, as the
clock ticks, and if it doesn't (for example, it decides to give
up and flash a Blue Screen of Death instead), then the consequences
are not an irritated or enraged user, but actual death in the real
world. Similarly, every two seconds the computer must
read the spacecraft's position from the inertial measurement
unit. If it fails to do so, it will hopelessly lose track of
which way it's pointed and how fast it is going. Real-time
programmers live under these demanding constraints and,
especially given the limitations of a computer such as the AGC,
must deploy all of their cleverness to meet them without fail,
whatever happens, including transient power failures,
flaky readings from instruments, user errors, and
completely unanticipated “unknown unknowns”.
The software which ran in the Lunar Module AGCs for Apollo
lunar landing missions was called LUMINARY, and in its final
form (version 210) used on Apollo 15, 16, and 17, consisted
of around 36,000 lines of code (a mix of assembly language
and interpretive code which implemented high-level operations),
of which Don Eyles wrote in excess of 2,200 lines, responsible
for the lunar landing from the start of braking from lunar
orbit through touchdown on the Moon. This was by far the most
dynamic phase of an Apollo mission, and the most demanding on
the limited resources of the AGC, which was pushed to around
90% of its capacity during the final landing phase where the
astronauts were selecting the landing spot and guiding the
Lunar Module toward a touchdown. The margin was razor-thin,
and that's assuming everything went as planned. But this was
not always the case.
It was when the unexpected happened that the genius of the AGC
software and its ability to make the most of the severely
limited resources at its disposal became apparent. As Apollo 11
approached the lunar surface, a series of five program alarms:
codes 1201 and 1202, interrupted the display of altitude and
vertical velocity being monitored by Buzz Aldrin and read off
to guide Neil Armstrong in flying to the landing spot. These
codes both indicated out-of-memory conditions in the AGC's
scarce read-write memory. The 1201 alarm was issued when
all five of the 44-word vector accumulator (VAC) areas were in use
when another program requested to use one, and 1202 signalled
exhaustion of the eight 12-word core sets required by
each running job. The computer had a single processor and
could execute only one task at a time, but its operating system
allowed lower priority tasks to be interrupted in order to
service higher priority ones, such as the time-critical autopilot
function and reading the inertial platform every two seconds.
Each suspended lower-priority job used up a core set and,
if it employed the interpretive mathematics library, a VAC,
so exhaustion of these resources usually meant the computer was
trying to do too many things at once. Task priorities
were assigned so the most critical functions would be completed
on time, but computer overload signalled something seriously
wrong—a condition in which it was impossible to guarantee
all essential work was getting done.
In this case, the computer would throw up its hands, issue a
program alarm, and restart. But this couldn't be a lengthy
reboot like customers of personal computers with millions of
times the AGC's capacity tolerate half a century later. The
critical tasks in the AGC's software incorporated restart
protection, in which they would frequently checkpoint their
current state, permitting them to resume almost instantaneously
after a restart. Programmers estimated around 4% of the AGC's
program memory was devoted to restart protection, and some
questioned its worth. On Apollo 11, it would save the landing
mission.
Shortly after the Lunar Module's landing radar locked onto
the lunar surface, Aldrin keyed in the code to monitor its
readings and immediately received a 1202 alarm: no core sets
to run a task; the AGC restarted. On the communications
link Armstrong called out “It's a 1202.” and
Aldrin confirmed “1202.”. This was followed by
fifteen seconds of silence on the “air to ground”
loop, after which Armstrong broke in with “Give us a
reading on the 1202 Program alarm.” At this point,
neither the astronauts nor the support team in Houston
had any idea what a 1202 alarm was or what it might mean
for the mission. But the nefarious simulation supervisors
had cranked in such “impossible” alarms in
earlier training sessions, and controllers had developed
a rule that if an alarm was infrequent and the Lunar Module
appeared to be flying normally, it was not a reason to abort the
descent.
At the Instrumentation Laboratory in Cambridge, Massachusetts,
Don Eyles and his colleagues knew precisely what a 1202 was and
found it was deeply disturbing. The AGC software had been
carefully designed to maintain a 10% safety margin under the
worst case conditions of a lunar landing, and 1202 alarms had
never occurred in any of their thousands of simulator runs using
the same AGC hardware, software, and sensors as Apollo 11's
Lunar Module. Don Eyles' analysis, in real time, just after a
second 1202 alarm occurred thirty seconds later, was:
Again our computations have been flushed and the LM is still flying. In Cambridge someone says, “Something is stealing time.” … Some dreadful thing is active in our computer and we do not know what it is or what it will do next. Unlike Garman [AGC support engineer for Mission Control] in Houston I know too much. If it were in my hands, I would call an abort.
As the Lunar Module passed 3000 feet, another alarm, this time a 1201—VAC areas exhausted—flashed. This is another indication of overload, but of a different kind. Mission control immediately calls up “We're go. Same type. We're go.” Well, it wasn't the same type, but they decided to press on. Descending through 2000 feet, the DSKY (computer display and keyboard) goes blank and stays blank for ten agonising seconds. Seventeen seconds later another 1202 alarm, and a blank display for two seconds—Armstrong's heart rate reaches 150. A total of five program alarms and resets had occurred in the final minutes of landing. But why? And could the computer be trusted to fly the return from the Moon's surface to rendezvous with the Command Module? While the Lunar Module was still on the lunar surface Instrumentation Laboratory engineer George Silver figured out what happened. During the landing, the Lunar Module's rendezvous radar (used only during return to the Command Module) was powered on and set to a position where its reference timing signal came from an internal clock rather than the AGC's master timing reference. If these clocks were in a worst case out of phase condition, the rendezvous radar would flood the AGC with what we used to call “nonsense interrupts” back in the day, at a rate of 800 per second, each consuming one 11.72 microsecond memory cycle. This imposed an additional load of more than 13% on the AGC, which pushed it over the edge and caused tasks deemed non-critical (such as updating the DSKY) not to be completed on time, resulting in the program alarms and restarts. The fix was simple: don't enable the rendezvous radar until you need it, and when you do, put the switch in the position that synchronises it with the AGC's clock. But the AGC had proved its excellence as a real-time system: in the face of unexpected and unknown external perturbations it had completed the mission flawlessly, while alerting its developers to a problem which required their attention. The creativity of the AGC software developers and the merit of computer systems sufficiently simple that the small number of people who designed them completely understood every aspect of their operation was demonstrated on Apollo 14. As the Lunar Module was checked out prior to the landing, the astronauts in the spacecraft and Mission Control saw the abort signal come on, which was supposed to indicate the big Abort button on the control panel had been pushed. This button, if pressed during descent to the lunar surface, immediately aborted the landing attempt and initiated a return to lunar orbit. This was a “one and done” operation: no Microsoft-style “Do you really mean it?” tea ceremony before ending the mission. Tapping the switch made the signal come and go, and it was concluded the most likely cause was a piece of metal contamination floating around inside the switch and occasionally shorting the contacts. The abort signal caused no problems during lunar orbit, but if it should happen during descent, perhaps jostled by vibration from the descent engine, it would be disastrous: wrecking a mission costing hundreds of millions of dollars and, coming on the heels of Apollo 13's mission failure and narrow escape from disaster, possibly bring an end to the Apollo lunar landing programme. The Lunar Module AGC team, with Don Eyles as the lead, was faced with an immediate challenge: was there a way to patch the software to ignore the abort switch, protecting the landing, while still allowing an abort to be commanded, if necessary, from the computer keyboard (DSKY)? The answer to this was obvious and immediately apparent: no. The landing software, like all AGC programs, ran from read-only rope memory which had been woven on the ground months before the mission and could not be changed in flight. But perhaps there was another way. Eyles and his colleagues dug into the program listing, traced the path through the logic, and cobbled together a procedure, then tested it in the simulator at the Instrumentation Laboratory. While the AGC's programming was fixed, the AGC operating system provided low-level commands which allowed the crew to examine and change bits in locations in the read-write memory. Eyles discovered that by setting the bit which indicated that an abort was already in progress, the abort switch would be ignored at the critical moments during the descent. As with all software hacks, this had other consequences requiring their own work-arounds, but by the time Apollo 14's Lunar Module emerged from behind the Moon on course for its landing, a complete procedure had been developed which was radioed up from Houston and worked perfectly, resulting in a flawless landing. These and many other stories of the development and flight experience of the AGC lunar landing software are related here by the person who wrote most of it and supported every lunar landing mission as it happened. Where technical detail is required to understand what is happening, no punches are pulled, even to the level of bit-twiddling and hideously clever programming tricks such as using an overflow condition to skip over an EXTEND instruction, converting the following instruction from double precision to single precision, all in order to save around forty words of precious non-bank-switched memory. In addition, this is a personal story, set in the context of the turbulent 1960s and early ’70s, of the author and other young people accomplishing things no humans had ever before attempted. It was a time when everybody was making it up as they went along, learning from experience, and improvising on the fly; a time when a person who had never written a line of computer code would write, as his first program, the code that would land men on the Moon, and when the creativity and hard work of individuals made all the difference. Already, by the end of the Apollo project, the curtain was ringing down on this era. Even though a number of improvements had been developed for the LM AGC software which improved precision landing capability, reduced the workload on the astronauts, and increased robustness, none of these were incorporated in the software for the final three Apollo missions, LUMINARY 210, which was deemed “good enough” and the benefit of the changes not worth the risk and effort to test and incorporate them. Programmers seeking this kind of adventure today will not find it at NASA or its contractors, but instead in the innovative “New Space” and smallsat industries.
Posted at November 3, 2019 13:32