I’ve been reading Seibel‘s Coders at Work, a series of interviews with various venerable hacker types. It’s fun!
Seibel often steers the conversation to a few clearly prepared questions. This seems to be a nice strategy, it realigns the interview and keeps it going when someone is banging on about his pet project, and it means we often get several opinions on a topic. One of the stock questions is, “what’s the worst bug you’ve had to deal with?”. Here’s one of mine.
I was part of a small team writing the garbage collector for a new implementation of the Dylan language. Our garbage collector was soft-realtime (pauses for GC for interactive programs were comparable to pauses due to page faults), generational copying, with a read-barrier, implemented on stock hardware and on multiprocessor machines. 32-bit Intel, 64-bit Alpha, 32- or 64-bit MIPS, SPARC (and probably more, I forget).
Our client were the compiler group. They were implementing the compiler for Dylan, and they had to integrate our garbage collector with their runtime. Lots of meetings about object layouts, header formats, hash tables, multithreaded stack allocation, that sort of thing. The compiler development was (finally) going through a bootstrapping phase. The compiler was being used to compile itself (previously the compiler was compiled using a version of Dylan that was “hosted” on a Common Lisp implementation). This was a troublesome phase. Compiling the compiler was typically done overnight, it took several hours. There was a change in the semantics of multimethod dispatch (I think), and it was important to get the first self-hosted implementation of the compiler with the new semantics out to the people using the compiler, so they didn’t accidentally write code to the old semantics.
The overnight compile was usually done on the dual processor SPARC. Unfortunately the overnight compile had stopped. At an assertion in the garbage collector. After running for several hours. It wasn’t really feasible to restart the compilation, I would be on the bus home before it hit the assertion again. And maybe it was due to some thread scheduling that only became a problem on a multiprocessor machine. The assert looked fine. It was checking that some value only increased monotonically. After some time thinking about in our group, we realised that the assert was fine, unless we were on a multiprocessor machine. I think the assertion was outside of a monitored block or something like that. The rest of the code (we reasoned) was fine. So the code was right, but the assertion was wrong.
We could edit the assertion out and ship a new release over to the compiler group. They would integrate the new release and it would be fine the next day. What about all those people in the US that wanted to use the new compiler when they woke later today? They couldn’t wait that long.
We patched the live binary. The overnight compile had stopped at this assertion and was showing it in the debugger. We looked up the NOP instruction in the architecture manuals and poked it in at the location where the assertion branched (or did it trap into the debugger? I forget). So now the assertion never fired. We let the debugger continue the process, and the compile ran fine with no garbage collection problems and finished a couple of hours later. Everyone could use the new compiler today!
This was not a problem that could have been fixed by using print.