My Favourite Bugs

2011-05-11

I’ve been reading Seibel‘s Coders at Work, a series of interviews with various venerable hacker types. It’s fun!

Seibel often steers the conversation to a few clearly prepared questions. This seems to be a nice strategy, it realigns the interview and keeps it going when someone is banging on about his pet project, and it means we often get several opinions on a topic. One of the stock questions is, “what’s the worst bug you’ve had to deal with?”. Here’s one of mine.

I was part of a small team writing the garbage collector for a new implementation of the Dylan language. Our garbage collector was soft-realtime (pauses for GC for interactive programs were comparable to pauses due to page faults), generational copying, with a read-barrier, implemented on stock hardware and on multiprocessor machines. 32-bit Intel, 64-bit Alpha, 32- or 64-bit MIPS, SPARC (and probably more, I forget).

Our client were the compiler group. They were implementing the compiler for Dylan, and they had to integrate our garbage collector with their runtime. Lots of meetings about object layouts, header formats, hash tables, multithreaded stack allocation, that sort of thing. The compiler development was (finally) going through a bootstrapping phase. The compiler was being used to compile itself (previously the compiler was compiled using a version of Dylan that was “hosted” on a Common Lisp implementation). This was a troublesome phase. Compiling the compiler was typically done overnight, it took several hours. There was a change in the semantics of multimethod dispatch (I think), and it was important to get the first self-hosted implementation of the compiler with the new semantics out to the people using the compiler, so they didn’t accidentally write code to the old semantics.

The overnight compile was usually done on the dual processor SPARC. Unfortunately the overnight compile had stopped. At an assertion in the garbage collector. After running for several hours. It wasn’t really feasible to restart the compilation, I would be on the bus home before it hit the assertion again. And maybe it was due to some thread scheduling that only became a problem on a multiprocessor machine. The assert looked fine. It was checking that some value only increased monotonically. After some time thinking about in our group, we realised that the assert was fine, unless we were on a multiprocessor machine. I think the assertion was outside of a monitored block or something like that. The rest of the code (we reasoned) was fine. So the code was right, but the assertion was wrong.

We could edit the assertion out and ship a new release over to the compiler group. They would integrate the new release and it would be fine the next day. What about all those people in the US that wanted to use the new compiler when they woke later today? They couldn’t wait that long.

We patched the live binary. The overnight compile had stopped at this assertion and was showing it in the debugger. We looked up the NOP instruction in the architecture manuals and poked it in at the location where the assertion branched (or did it trap into the debugger? I forget). So now the assertion never fired. We let the debugger continue the process, and the compile ran fine with no garbage collection problems and finished a couple of hours later. Everyone could use the new compiler today!

This was not a problem that could have been fixed by using print.

About these ads

9 Responses to “My Favourite Bugs”

  1. Ted Lemon Says:

    Is that your favorite bug, or your favorite hack? It’s definitely a truly moby hack!

  2. LongSteve Says:

    I’ve never actually encountered a genuine compiler bug. A compiler bug that occured while compiling a compiler is epic though :-)

  3. Jason Riedy Says:

    “So the code was right, but the assertion was wrong.” When this occurs in my code, this means my mental model was slightly wrong. Often there are other places where I applied that incorrect model. Assertions are not just about *code* but also expectations. It worked for you (F-D?), but a similar mentality among consumers of my code has hurt…

  4. Nick Barnes Says:

    Garbage collector bugs are among my favourites. Generally speaking – especially with GCs which are less comprehensively sanity-checking than the one DRJ describes – a bug in a copying GC takes the form of a failure to locate and correct a pointer, when moving an object. Such a bug may never manifest (if the program never follows the pointer; commonplace in language implementations with insane allocation rates, like most ML compilers). But if it does, it will be at a later point in time, when the program attempts to follow the pointer, and the entire state of the collection which failed to update it has been entirely lost.

    Also among my favourite bugs: a failure to flush an instruction cache, when moving some code (so the instructions which cause the failure are no longer in main memory, only in the I-cache).

    And the time our bootstrap system failed to have a correct representation of 10.0. Naturally, this bogus representation propagated through several bootstrap generations (because a commonplace and naive approach to parsing a floating-point constant involves multiplying by the compiler’s own notion of 10.0).

    Then there’s the time I got our delivered application size for “Hello, World!” down from 32MB to 4K (or some such) over the course of a couple of days.

    • drj11 Says:

      I intend at some point to expand a little on your “10.0″ bug, with reference to the Kernighan trojan paper.

      • Nick Barnes Says:

        Yes, it’s closely related of course. In fact, doesn’t Kernighan mention it? I expect everyone who has bootstrapped a compiler has run into floating-point bootstrapping issues such as this. The 10.0 one was not the only one. When we cross-compiled, bootstrapping our first little-endian architecture, there was something similar in the floating-point.

  5. Gareth Rees Says:

    That’s an entertaining description of an interesting hack. It inspired me to write up one of my own favourite bugs.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: