My Favourite Bugs

2011-05-11

I’ve been reading Seibel‘s Coders at Work, a series of interviews with various venerable hacker types. It’s fun!

Seibel often steers the conversation to a few clearly prepared questions. This seems to be a nice strategy, it realigns the interview and keeps it going when someone is banging on about his pet project, and it means we often get several opinions on a topic. One of the stock questions is, “what’s the worst bug you’ve had to deal with?”. Here’s one of mine.

I was part of a small team writing the garbage collector for a new implementation of the Dylan language. Our garbage collector was soft-realtime (pauses for GC for interactive programs were comparable to pauses due to page faults), generational copying, with a read-barrier, implemented on stock hardware and on multiprocessor machines. 32-bit Intel, 64-bit Alpha, 32- or 64-bit MIPS, SPARC (and probably more, I forget).

Our client were the compiler group. They were implementing the compiler for Dylan, and they had to integrate our garbage collector with their runtime. Lots of meetings about object layouts, header formats, hash tables, multithreaded stack allocation, that sort of thing. The compiler development was (finally) going through a bootstrapping phase. The compiler was being used to compile itself (previously the compiler was compiled using a version of Dylan that was “hosted” on a Common Lisp implementation). This was a troublesome phase. Compiling the compiler was typically done overnight, it took several hours. There was a change in the semantics of multimethod dispatch (I think), and it was important to get the first self-hosted implementation of the compiler with the new semantics out to the people using the compiler, so they didn’t accidentally write code to the old semantics.

The overnight compile was usually done on the dual processor SPARC. Unfortunately the overnight compile had stopped. At an assertion in the garbage collector. After running for several hours. It wasn’t really feasible to restart the compilation, I would be on the bus home before it hit the assertion again. And maybe it was due to some thread scheduling that only became a problem on a multiprocessor machine. The assert looked fine. It was checking that some value only increased monotonically. After some time thinking about in our group, we realised that the assert was fine, unless we were on a multiprocessor machine. I think the assertion was outside of a monitored block or something like that. The rest of the code (we reasoned) was fine. So the code was right, but the assertion was wrong.

We could edit the assertion out and ship a new release over to the compiler group. They would integrate the new release and it would be fine the next day. What about all those people in the US that wanted to use the new compiler when they woke later today? They couldn’t wait that long.

We patched the live binary. The overnight compile had stopped at this assertion and was showing it in the debugger. We looked up the NOP instruction in the architecture manuals and poked it in at the location where the assertion branched (or did it trap into the debugger? I forget). So now the assertion never fired. We let the debugger continue the process, and the compile ran fine with no garbage collection problems and finished a couple of hours later. Everyone could use the new compiler today!

This was not a problem that could have been fixed by using print.


The importance of AV

2011-04-04

As you may know here in the UK we’re about to have a referendum to change the way we vote for our elected MPs. It means I have something important to talk about.

Kerning.

Kerning is the “negative space” between a pair of letters. Where the shapes of two adjacent letterforms complement each other, there is a danger of leaving too large a space between them if you space out the letters too rigidly. In the logo above, the A and the V in the first line have been brought closer together compared to the second line. I hope you can appreciate that “Yes to AV” has the better letter-spacing. And if you can’t, GET OFF MY BLOG!

This is the launch of the “Yes to AV; No to A V” campaign. I even have a logo with totally web 2.0 compliant round corners in a shade that isn’t blue.

Any half decent package that allows you to type in text should allow you to adjust the kerning, the spacing between the letters. In Mac OS X the feature has been built into the standard text widget, and hence incorporated into every application that plays nicely, since the dawn of time. I did the graphic in Inkscape which, being an X11 app, does not play nicely, but it didn’t matter. Most of the time you don’t need to hand kern the spacing between individual letters, because the font files that describe the letterforms also come with kerning tables, to adjust the spacing between pairs of letters that comonly need attention. “AV” is not as common as “Ye” or “Wo” but i’d still expect it to be kerned by the font software.

And so it is. It turns out that in my installation of Inkscape I have two Times fonts that I can choose in the pull-down font selector thingy. One called Times New Roman, one called Times.

So in making my “Yes to AV; No to A V” logo, all I had to do was choose Times New Roman for the “Yes to AV” line, and choose Times for the “No to A V” line. I expect Times New Roman is the system font exposed as an X11 font, and Times is some craptastic knockoff supplied with Inkscape that obviously has second-rate kerning tables. Thanks for that.

This article was of course inspired by the leader in The Observer (you will recall the dark ages of the 20th century when people read the news in print; The Observer is a notable Sunday publication from that era):

Weird. The appalling typography almost distracted me from reading the article (another win for that new-fangled web technology). Even weirder: they get their typography sorted on the opposite page.

So the message is clear: “Yes to AV; No to A V”.


Fun with Polarization

2010-10-13

I went to see Toy Story 3 in 3D (great film, but the 3D was not worth it); so I have a pair of those cheap plastic “3D” glasses.

They’re actually polarising lenses (er, filter, not lens, as the picture is not magnified or minified, but whatever). You can use them to Do Science at home:

Birefringence

Above is a picture of an iPod taken through one of the lenses (of the 3D glasses). The coloured pattern is, I think, due to birefringence, but I’m not competent to explain it entirely.

Same shot, but using the other lens:

So we see that the lens can be used to see patterns that are otherwise not visible. In this case, we can see stress patterns in the plastic (I think). Looking a bit more carefully at the iPod pictures, we can see that the patterns on the iPod are different for the two lenses. Thus, there are two different sorts of light that each lens passes.

The next picture involves a mirror:

I am holding the camera behind one of the lenses and shooting a picture into the mirror so I can see myself and the glasses and camera in the mirror. [edit: it's more fun to put the glasses on and look into a mirror and close one eye, but I can't show you a picture of that.]

Note that from the viewer side both of the lenses in the glasses are transparent. But as seen in the mirror, one of the lenses is opaque. Curiously the lens that is opaque in the mirror, is the one in front of the camera. How did the picture get taken!? Clearly light is able to enter the camera having passed through the right-hand lens. But the camera, behind the right-hand lens, is not visible. So light leaving the body of the camera through the right-hand lens cannot pass through the lens again when reflected in the mirror:

My explanation of this is that the light acquires a property as it passes through the lens, and the mirror changes that property so that light with the new property cannot pass through the lens. The property is the handedness of circular polarisation. The right-hand lens filters the light so it contains right-circular polarised light (say); the mirror reflects it into left-circular polarised light which cannot pass back through the lens (and into the camera sensor). Note that light that passes through the left lens and is reflected in the mirror, can pass through the right-lens.

[edit: however, I have a problem with this explanation, as it does not explain what happens when I get my laser pointer out...]

3D films work by displaying one frame through a left-circular polarising filter, and the next through a right-circular polarising filter, and alternating. A reflective screen with the right properties will maintain the separate polarisation of alternate frames. Circular polarising filters in the glasses are insensitive to viewing angle, so it’s a superior option to linear polarisation.


Colouring Doubt’s Flag

2010-09-24

Judith Curry is keen to frame doubt in the form of an italian flag. Specifically with reference to this statement from IPCC WG1 Summary for Policy Makers:

Most of the observed increase in global average temperatures since the mid-20th century is very likely due to the observed increase in anthropogenic greenhouse gas concentrations.

Curry’s flag interpretation of this statement is that we could colour the flag 5% white (uncommitted belief), 67% green (anthropogenic forcing), 28% red (natural variability). A minor quibble: the opposite of “due to increase in anthropogenic GHG” is not “natural variability” as that excludes other anthropogenic activies such as sulphate emissions and secondary effects like ozone increases due to Montreal protocol. Anyway, her flag would look like this (if she drew it):

Does that seem right, can we be almost certain that there is only 5% wiggle room for doubt? Also, if I say that 70% of the variation is anthropogenic that doesn’t mean the rest (or almost all the rest) is natural, it just means I don’t know. I interpret the IPCC statement as meaning that there are a wide range of supportable beliefs about the anthropogenic cause of 20th century warming, but 95% (ish) of those will have more than 50% of the flag coloured green. Amongst the population of possible flags is this one:

Note that this flag already represents quite an extreme position with respect to the IPCC statement, because whilst it’s compatible with the IPCC statement, only 5% of the flags have a smaller green area than this. Here’s a more median position:

How can we represent the range of beliefs that are compatible with the IPCC statement. Like this?


Python dictionary: iterating and adding

2010-03-05

On one of my earlier articles about Python dictionaries Jay asks (I paraphrase): What if you want to add items to a dictionary whilst iterating? And what if you want those items to appear in the iteration? Python’s standard dictionary iterators won’t help you.

Here’s a plan:

  1. keep a set, visited of keys that you have already visited;
  2. get the set of keys from the dictionary and subtract visited, leaving you with a set of keys not yet visited, todo;
  3. visit each of the keys in todo;
  4. now we have visited each of the keys in todo, add those keys to visited;
  5. during the iteration the set of keys in the dictionary may have changed, so repeat from (2), stopping when todo is empty.
def iterall(adict):
    """Iterate over all the items in a dictionary,
    including ones which are added during the iteration.
    """
    visited = set()
    while True:
        todo = set(adict.keys()) - visited
        if not todo:
            break
        for k in todo:
            if k in adict:
                yield k, adict[k]
        visited |= todo

Here it is in use:

>>> d=dict(a=1,b=2)
>>> for k,v in iterall(d):
...     print k,v
...     d['c'+k[0]] = 'new'
... 
a 1
b 2
cb new
ca new
cc new

Python’s set datatype makes this easy to program and it has reasonable performance.

Obligatory Python whinge: The function implemented above, iterall, is effectively a new method that I’ve implemented for the dict class. But I can’t add a new method to a class that someone else implemented. Unlike, say, Common Lisp, Dylan, Smalltalk, and Objective-C, where I can.


Python: a string indexing trick

2010-02-23

Let’s say you want to extract all the lines of a file where the character at index 5 is not a space:

with open('sequences.gb') as f:
  for line in f:
    if line[5] != ' ':
      print line[:-1]

Problem: short lines. line[5] might not be a valid index. Exception.

You might consider line[5:6]. Normally this is the same as line[5], but when line is short then instead of raising an Exception it evaluates to the empty string, "".

In this case it will result in short lines being selected and printed. If the test was if line[5:6] in string.letters then it will result in short lines being rejected. But that changes the semantics of the test for the lines that are long enough. If I had written if ' '.startswith(line[5:6]) then I keep the same test semantics for long lines and I reject short lines. It’s weird and not clear though.

Sometimes this trick may be appropriate, and you may be able to tweak your test to accommodate it. Don’t let the clarity slide. There is a great deal of merit to: if len(line) > 5 and line[5] != ' '.


Python: Getting FASTA with itertools.groupby

2010-02-22

I’d like to introduce you to my new best friend: itertools.groupby.

Let’s say you want to do some processing of the genes in Danio rerio (zebrafish). Here’s the Ensembl project’s FASTA file containing all the peptide sequences: ftp://ftp.ensembl.org/pub/release-56/fasta/danio_rerio/pep/Danio_rerio.Zv8.56.pep.all.fa.gz.

(After I’ve decompressed it) the file looks like this:

>ENSDARP00000100686 pep:known scaffold:Zv8:Zv8_NA5235:758:1999:1 gene:ENSDARG00000076631 transcript:ENSDART00000111668
LKSHSNTQPYLTVDSSGKYLRSQLHSQWHPINEKFRCEVCGRCFHSNTNLLVHYAVHTGE
KPYKCSFCEKGFSQKGNLQAHERIHRGEKPFSCVICGRSFTQKVCLRNHERIHRGEKPFT
CMTCGKGFTQKVTLQQHLAVHDKTTKRLCITCSSTFRTSSAKAVHKPIQSSNMP
>ENSDARP00000094838 pep:known scaffold:Zv8:Zv8_NA8956:1000:1382:1 gene:ENSDARG00000070696 transcript:ENSDART00000104063
FLISSVDNSEYMRNGDFLPTRLQAQQDAVNIICHSKTRSNPENNVGLITMANNCEVLTTL
TPDTGRILSKLHAVQPRGVISFCTGIRVAHV
>ENSDARP00000102657 pep:known scaffold:Zv8:Zv8_NA8774:1123:1377:-1 gene:ENSDARG00000078974 transcript:ENSDART00000114102
MASAPPSSPHNRSRSEDEEDPVDTMISKTGCAELHYTLLDCMAEHQDWRKCQTEVLKFKE
CMSAYQNTRKEQLLKQKTSATESV
[...]

It’s a list of peptide sequences, with each sequence consisting of a header line beginning with ‘>’:

>ENSDARP00000100686 pep:known

The header contains identification information (and is probably too long for my blog); following the header is a number of lines of peptide data (using IUPAC/IUBMB amino acid abbreviations):

LKSHSNTQPYLTVDSSGKYLRSQLHSQWHP

The job of itertools.groupby(iterable, key=fn) is to group together the items in iterable that have the same key. For example if you have a sequence of e-mail addresses, and you use a key function that extracts the domain part of the e-mail address (def domainkey(m): return m.partition('@')[-1]) then each group that you get out of itertools.groupby will have e-mails that share the domain part after the ‘@’.

It is usual for the data sequence passed to itertools.groupby to be sorted, but it doesn’t have to be, and itertools.groupby won’t sort it. This is useful. For example, returning to our FASTA file, if we have a key function that simply identifies a line as a header or not:

def isheader(line):
    return line[0] == '>'

Then when we use this in a call to itertools.groupby the groups will simply alternate between a group of header lines, and a group of non-header lines. Like this (The printout is truncated on the right so that I had some space to write in):

(Note that all of the “header” groups are just one line long, since a sequence has only one header line)

Since the groups just alternate header, non-header, the number of groups is twice the number of sequences:

import itertools
with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  print len(list(itertools.groupby(f, key=isheader)))//2

The above code tells me the file has 28,630 peptide sequences. Of course, using len and list is a bit naughty; len doesn’t work on iterables so I had to convert it to a list, not good (on memory usage). Could’ve used sum, a generator expression, and a bit of Iverson’s Convention: sum(g for g,_ in itertools.groupby(open('Danio_rerio.Zv8.56.pep.all.fa'), key=isheader)).

The useful thing about itertools.groupby is that it lets you deal with the header and the peptide sequence as a single group, without having to worry about detecting the start of the next sequence or handling end-of-file correctly.

Say you wanted to make a dictionary that mapped from Ensembl identifiers to the peptide sequence itself:

def aspairs(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      line = group.next()
      ensembl_id = line[1:].split()[0]
    else:
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  d = dict(aspairs(f))

Notice how ensembl_id is set when a header line is read, but only used on the next spin through the loop when the sequence group is processed (in the else).

Filtering is easy too. Well, for one thing you can write a filter/comprehension on the output of aspairs, but say you didn’t want to do that, and you wanted to filter in the loop itself. You need another variable, set when the header is processed, that says whether we want to process (and yield) the sequence. Say we’re interested in only the novel genes (the second word of the header is “pep:novel”):

def aspairsnovel(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      line =group.next()
      word = line[1:].split()
      ensembl_id = word[0]
      isnovel = word[1] == 'pep:novel'
    elif isnovel:
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

Why do I mention using itertools.groupby at all? Because if you try processing a FASTA file using a for loop it gets a bit awkward:

def aspairs(file):
  seq = ''
  for line in file:
    if isheader(line):
      if seq:
        yield ensembl_id, seq
      ensembl_id = line[1:].split()[0]
      seq = ''
    else:
      seq += line.strip()
  yield ensembl_id, seq

The code to emit the (id,sequence) pair occurs when a new header line is detected (inside the if isheader, above), and it has to have a special case for the first header line in the file (because there is no sequence to emit). That code to emit the (id,sequence) pair also has to appear after the for loop to cope with the last sequence in the file. The variable used to accumulate the sequence data, seq, has to be reset in two different places: once before the loop begins, and then inside the loop every time we see a header line.

Awkward.

Using a while loop is even worse because now you have to detect end-of-file as well.

The idea of using itertools.groupby to parse files like this came out of a discussion between me and Nick Barnes while we were coding on Clear Climate Code.

Step 2 of the GISTEMP algorithm begins by reading the Ts.txt which looks like this:

  495-11424037187400101286BEAVER MINES,AL                     
1935  -80    4  -50  -16   69  113  153  136  107   38 9999  -11
1936  -89 -240  -37   28  116  131  172  145  107   73   16  -73
1937 -183 -114  -24   45   92  120  153  138  103   76  -21  -53
1938  -28 -110   -4   30   84  133  158  132  147   76  -21  -38
[...more records here...]
  495-11404037187400201189PINCHER CREEK A,AL                  
1979 9999 9999 9999 9999 9999  145  168  164  139   80  -11  -13
1980 -113  -49  -27   76  110  129  166  131  113   82   14  -73
1981   -3  -24   24   57   92  115  155  177  126   55   32  -54
[...more records here...]

You can see the structural similarity to FASTA format. In this case we have sequences of temperature records with each block starting with a header line that has information about the station.

Nick had the idea of using itertools.groupby and I had the idea of using key=lambda line: line[0]==' ' (which is essentially the version of isheader that you need for Ts.txt files). If you like you can compare read_text from step2.py with its counterpart, text_to_binary, from the earlier version. Although you should note that this is not strictly a fair comparison because the earlier version is very close the original Fortran, and there are more changes than just having the idea of using itertools.groupby.

The application to FASTA files was inspired by my reading Mitchell Model’s «Bioinformatics Programming Using Python».

Bonus exercise (for the diligent reader): How bad is it to convert the output of itertools.groupby to a list, like I do in my counting example: len(list(itertools.groupby(f, key=isheader)))//2 ?


Screw the planet!

2010-02-11

The earth is getting warmer. Ecosystems are changing. Food webs around the world are upset. We are in the middle of mass extinction.

For years the green lobby have been complaining about this, finally over the last decade their voice can be heard as global warming becomes a political talking point. We can save the Earth.

Screw that! The greens are getting way too much political capital out of global warming. Sending your money to Friends of the Earth, World Wildlife Fund for Nature, the RSPB, and so on, will not solve global warming. They are not in the game.

The planet is not under threat. Our life on this planet is under threat. Runaway global warming will be bad. For any metazoan. Everything larger than than a hyrax will be wiped out. But life will go on, the Earth will be claimed once more by the Archaea and it will be business as usual really. Nature’s brief experimental dalliance with multi-cellular life will have ended. She will conclude that “further research is necessary”.

So the debate around global warming is not about saving the Earth, it is about saving our way of life on the Earth. And let’s get this straight, once we’ve “solved” global warming, we’ll have put a nuclear power plant on every (100 Km or so of) coastline, we’ll have replaced inefficient sheep pastures with huge plantations of biofuel crops, we’ll have peppered the entire countryside with wind turbines, and glazed vast deserts with solar PV farms.

We will not have saved the polar bear, the poison arrow tree frogs. All those bromeliads teetering on the brink of their fragile hilltop ecosystems in the South American rain forest? All gone. It will be our fault, collectively. And it will be sad. But solving global warming and saving your favourite obscure cute species are not the same problem.


Not in 2010

2010-02-01

Things I predict we won’t be seeing in 2010:

  • Perl 6
  • Latex 3
  • 1:25,000 scale Ordnance Survey vector data made public
  • the hottest November
  • iTunes for Linux
  • reduce in Python 3.x
  • A Green Party MP (you know, the UK parliament, not some cheap plastic Euro knock-off)

  • What is IsItHotOrNot?

    2009-12-18

    This is the landing page for the Twitter account IsItHotOrNot. I’ll probably move it to a proper home at some point.

    IsItHotOrNot displays the global temperature anomaly, month by month. Currently the data is sourced from GISTEMP, specifically the “Global-mean monthly land-ocean temperature index”. The file, GLB.Ts+dSST.txt, also has annual and seasonal values, but I don’t currently use those.

    What does the number mean in “2009 Nov +0.680, hottest November on record”? It’s November’s temperature anomaly (in degrees C). The anomaly for November is the difference between this November and the average November temperature between 1951 and 1980, the base or reference period. This year’s November was 0.68 degrees C warmer than the average between 1951 and 1980. Obviously the anomalies for other months are calculated in a similar way. Note that this method takes account of the fact that Novembers are, on the whole, colder than Junes, by using the average November temperature between 1951 and 1980, not the average temperature. In other words it does seasonal correction.

    It would be nice to say “about 1.0 degrees C” warmer than pre-industrial levels, like the EEA do, and I can do that by simply adding 0.3 to all the values, but since the errors for the 1850 to 1899 (which is what is usually used when you see the phrase “pre-industrial”) average are larger (on account of having fewer records), I decided not to do that. You can wing the error and add 0.3 to all the values if you like.

    There are other temperature indexes available: NCDC do one; there’s the HadCRUT3 diagnostics; the Japanese Metereological Agency have their global average surface temperature anomalies. The different indexes use different reference periods (NCDC use 1900 to 1999, HadCRUT3 use 1961 to 1990, JMA use 1971 to 2000), the only effect of which is to move the graphs up and down the page (it does mean that if you want to compare the time series from two different providers, you need to rebase them to give them a common reference period).

    Where does the data come from? Well, from GISTEMP, but the land and ocean index is a data product synthesized from other sources:GHCN (land), USHCN (USA land), HadSST2 (ocean), Reynolds SST (ocean), SCAR READER (antarctic land). The other indexes use more or less the same data, the differences are mostly in how it is processed.

    Why GISTEMP? All the indexes are basically the same, so it doesn’t really matter which one I pick. Because of my work on Clear Climate Code I happened to be familiar with their datasets, so that was simpler from a programming perspective. The differences between the indexes are typically very small, but it does mean that the statements like: “second hottest September on record” are very sensitive to the exact dataset used. The JMA had September 2009 tying for first place, whereas GISTEMP have it second. I regard the “hottest November on record” stuff as nothing more than a popular discussion point, not an absolute truth.


    Follow

    Get every new post delivered to your Inbox.