Archive for May, 2011

My Favourite Bugs, Part 2


Another one.

Two of us were working at the client’s site on the software for what was essentially a compact static robot that was responsible for moving small items from a hopper to a delivery point. There was approximately one instance of the hardware that we were writing software for, and it sat on a bench between our two workstations. Most of the mechanical bits were in a more or less finished state. I was particularly impressed with the piston that generated partial vacuum so that items could be picked with a moving arm with suction cups on. Just one or two gears were made of prototyping plastic; and because of a gearing problem the belt didn’t move at the speed that it said it should in the spec. But you know, typical prototype hardware. The electronics were a mixture of off-the-shelf dev kits for 8-bit embedded micros, mini custom circuit boards for novel sensors, and lovingly hand soldered discrete parts. Add to that the fact that as a software guy I didn’t really understand the importance of grounding, and La Machine wasn’t always completely reliable (my colleague had just lent me Tracy Kidder’s Soul of a New Machine).

Various optical/IR sensors kept track of the items as they moved inside the internals of the machine, various other sensors kept track of motor positions and/or speeds. There was a slightly hairy state machine (documented using OmniGraffle) to keep track of it all. The target pick rate was 5 items per second, and as it took more than 200ms for an item to go from the hopper to the point where it left the machine, there could be several items “in-flight” at any one time (and of course, picking an item was never completely reliable, so the sensors were used to track the items, and determine if a retry was required).

This day it seemed to be working fine, except for some reason the software was reporting that items were failing to be delivered, when in fact I could plainly see that items were popping out of the top of the machine. This was causing the machine to prematurely stop, as it would, sensibly, stop picking items if it thought that a picked item was stuck inside the machine somewhere. Up to this point it had basically been working fine; it had been working the same morning. I was sat thinking about this and investigating somewhat. I’d even checked the last thing I’d changed. So I called my colleague over (just at the next desk) and he came over to look while I demonstrated the problem. It worked. There was no problem. Flakies (there’s a memorable part of Kidder’s book where one of the engineers in helping a more junior engineer sort out a problem with a wire-wrap memory board, grabs the whole frame and shakes it, claiming that it’s probably just flakies; of course it works after that).

So that’s okay then. But when I try it again, it’s not working. Colleague comes over. Works. I work at it on my own. Doesn’t work.

It turned out that it was quite a sunny day. The sun reached the window in the afternoon. Sunlight was falling on machine near the optical sensor and presumably bouncing around enough to prevent the sensor from registering the occlusion as the item went past. When my colleague came over, he was standing over the machine and his shadow blocked the sunlight. This wouldn’t be a problem in the production hardware, as it was all in a box (and hence, dark inside).

I constructed an optical shield and installed it on La Machine (a post-it note stuck to the side).

This was not a problem that could be fixed using print. We did have print via the serial port, but printing more than one character was hazardous because it meant that time taken to transmit characters on the serial line would interfere with the realtime operation of the rest of the software (when items are being picked every 200ms, every millisecond counts).

My Favourite Bugs


I’ve been reading Seibel‘s Coders at Work, a series of interviews with various venerable hacker types. It’s fun!

Seibel often steers the conversation to a few clearly prepared questions. This seems to be a nice strategy, it realigns the interview and keeps it going when someone is banging on about his pet project, and it means we often get several opinions on a topic. One of the stock questions is, “what’s the worst bug you’ve had to deal with?”. Here’s one of mine.

I was part of a small team writing the garbage collector for a new implementation of the Dylan language. Our garbage collector was soft-realtime (pauses for GC for interactive programs were comparable to pauses due to page faults), generational copying, with a read-barrier, implemented on stock hardware and on multiprocessor machines. 32-bit Intel, 64-bit Alpha, 32- or 64-bit MIPS, SPARC (and probably more, I forget).

Our client were the compiler group. They were implementing the compiler for Dylan, and they had to integrate our garbage collector with their runtime. Lots of meetings about object layouts, header formats, hash tables, multithreaded stack allocation, that sort of thing. The compiler development was (finally) going through a bootstrapping phase. The compiler was being used to compile itself (previously the compiler was compiled using a version of Dylan that was “hosted” on a Common Lisp implementation). This was a troublesome phase. Compiling the compiler was typically done overnight, it took several hours. There was a change in the semantics of multimethod dispatch (I think), and it was important to get the first self-hosted implementation of the compiler with the new semantics out to the people using the compiler, so they didn’t accidentally write code to the old semantics.

The overnight compile was usually done on the dual processor SPARC. Unfortunately the overnight compile had stopped. At an assertion in the garbage collector. After running for several hours. It wasn’t really feasible to restart the compilation, I would be on the bus home before it hit the assertion again. And maybe it was due to some thread scheduling that only became a problem on a multiprocessor machine. The assert looked fine. It was checking that some value only increased monotonically. After some time thinking about in our group, we realised that the assert was fine, unless we were on a multiprocessor machine. I think the assertion was outside of a monitored block or something like that. The rest of the code (we reasoned) was fine. So the code was right, but the assertion was wrong.

We could edit the assertion out and ship a new release over to the compiler group. They would integrate the new release and it would be fine the next day. What about all those people in the US that wanted to use the new compiler when they woke later today? They couldn’t wait that long.

We patched the live binary. The overnight compile had stopped at this assertion and was showing it in the debugger. We looked up the NOP instruction in the architecture manuals and poked it in at the location where the assertion branched (or did it trap into the debugger? I forget). So now the assertion never fired. We let the debugger continue the process, and the compile ran fine with no garbage collection problems and finished a couple of hours later. Everyone could use the new compiler today!

This was not a problem that could have been fixed by using print.