Archive for February, 2010

Python: a string indexing trick

2010-02-23

Let’s say you want to extract all the lines of a file where the character at index 5 is not a space:

with open('sequences.gb') as f:
  for line in f:
    if line[5] != ' ':
      print line[:-1]

Problem: short lines. line[5] might not be a valid index. Exception.

You might consider line[5:6]. Normally this is the same as line[5], but when line is short then instead of raising an Exception it evaluates to the empty string, "".

In this case it will result in short lines being selected and printed. If the test was if line[5:6] in string.letters then it will result in short lines being rejected. But that changes the semantics of the test for the lines that are long enough. If I had written if ' '.startswith(line[5:6]) then I keep the same test semantics for long lines and I reject short lines. It’s weird and not clear though.

Sometimes this trick may be appropriate, and you may be able to tweak your test to accommodate it. Don’t let the clarity slide. There is a great deal of merit to: if len(line) > 5 and line[5] != ' '.

Python: Getting FASTA with itertools.groupby

2010-02-22

I’d like to introduce you to my new best friend: itertools.groupby.

Let’s say you want to do some processing of the genes in Danio rerio (zebrafish). Here’s the Ensembl project’s FASTA file containing all the peptide sequences: ftp://ftp.ensembl.org/pub/release-56/fasta/danio_rerio/pep/Danio_rerio.Zv8.56.pep.all.fa.gz.

(After I’ve decompressed it) the file looks like this:

>ENSDARP00000100686 pep:known scaffold:Zv8:Zv8_NA5235:758:1999:1 gene:ENSDARG00000076631 transcript:ENSDART00000111668
LKSHSNTQPYLTVDSSGKYLRSQLHSQWHPINEKFRCEVCGRCFHSNTNLLVHYAVHTGE
KPYKCSFCEKGFSQKGNLQAHERIHRGEKPFSCVICGRSFTQKVCLRNHERIHRGEKPFT
CMTCGKGFTQKVTLQQHLAVHDKTTKRLCITCSSTFRTSSAKAVHKPIQSSNMP
>ENSDARP00000094838 pep:known scaffold:Zv8:Zv8_NA8956:1000:1382:1 gene:ENSDARG00000070696 transcript:ENSDART00000104063
FLISSVDNSEYMRNGDFLPTRLQAQQDAVNIICHSKTRSNPENNVGLITMANNCEVLTTL
TPDTGRILSKLHAVQPRGVISFCTGIRVAHV
>ENSDARP00000102657 pep:known scaffold:Zv8:Zv8_NA8774:1123:1377:-1 gene:ENSDARG00000078974 transcript:ENSDART00000114102
MASAPPSSPHNRSRSEDEEDPVDTMISKTGCAELHYTLLDCMAEHQDWRKCQTEVLKFKE
CMSAYQNTRKEQLLKQKTSATESV
[...]

It’s a list of peptide sequences, with each sequence consisting of a header line beginning with ‘>':

>ENSDARP00000100686 pep:known

The header contains identification information (and is probably too long for my blog); following the header is a number of lines of peptide data (using IUPAC/IUBMB amino acid abbreviations):

LKSHSNTQPYLTVDSSGKYLRSQLHSQWHP

The job of itertools.groupby(iterable, key=fn) is to group together the items in iterable that have the same key. For example if you have a sequence of e-mail addresses, and you use a key function that extracts the domain part of the e-mail address (def domainkey(m): return m.partition('@')[-1]) then each group that you get out of itertools.groupby will have e-mails that share the domain part after the ‘@’.

It is usual for the data sequence passed to itertools.groupby to be sorted, but it doesn’t have to be, and itertools.groupby won’t sort it. This is useful. For example, returning to our FASTA file, if we have a key function that simply identifies a line as a header or not:

def isheader(line):
    return line[0] == '>'

Then when we use this in a call to itertools.groupby the groups will simply alternate between a group of header lines, and a group of non-header lines. Like this (The printout is truncated on the right so that I had some space to write in):

(Note that all of the “header” groups are just one line long, since a sequence has only one header line)

Since the groups just alternate header, non-header, the number of groups is twice the number of sequences:

import itertools
with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  print len(list(itertools.groupby(f, key=isheader)))//2

The above code tells me the file has 28,630 peptide sequences. Of course, using len and list is a bit naughty; len doesn’t work on iterables so I had to convert it to a list, not good (on memory usage). Could’ve used sum, a generator expression, and a bit of Iverson’s Convention: sum(g for g,_ in itertools.groupby(open('Danio_rerio.Zv8.56.pep.all.fa'), key=isheader)).

The useful thing about itertools.groupby is that it lets you deal with the header and the peptide sequence as a single group, without having to worry about detecting the start of the next sequence or handling end-of-file correctly.

Say you wanted to make a dictionary that mapped from Ensembl identifiers to the peptide sequence itself:

def aspairs(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      line = group.next()
      ensembl_id = line[1:].split()[0]
    else:
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  d = dict(aspairs(f))

Notice how ensembl_id is set when a header line is read, but only used on the next spin through the loop when the sequence group is processed (in the else).

Filtering is easy too. Well, for one thing you can write a filter/comprehension on the output of aspairs, but say you didn’t want to do that, and you wanted to filter in the loop itself. You need another variable, set when the header is processed, that says whether we want to process (and yield) the sequence. Say we’re interested in only the novel genes (the second word of the header is “pep:novel”):

def aspairsnovel(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      line =group.next()
      word = line[1:].split()
      ensembl_id = word[0]
      isnovel = word[1] == 'pep:novel'
    elif isnovel:
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

Why do I mention using itertools.groupby at all? Because if you try processing a FASTA file using a for loop it gets a bit awkward:

def aspairs(file):
  seq = ''
  for line in file:
    if isheader(line):
      if seq:
        yield ensembl_id, seq
      ensembl_id = line[1:].split()[0]
      seq = ''
    else:
      seq += line.strip()
  yield ensembl_id, seq

The code to emit the (id,sequence) pair occurs when a new header line is detected (inside the if isheader, above), and it has to have a special case for the first header line in the file (because there is no sequence to emit). That code to emit the (id,sequence) pair also has to appear after the for loop to cope with the last sequence in the file. The variable used to accumulate the sequence data, seq, has to be reset in two different places: once before the loop begins, and then inside the loop every time we see a header line.

Awkward.

Using a while loop is even worse because now you have to detect end-of-file as well.

The idea of using itertools.groupby to parse files like this came out of a discussion between me and Nick Barnes while we were coding on Clear Climate Code.

Step 2 of the GISTEMP algorithm begins by reading the Ts.txt which looks like this:

  495-11424037187400101286BEAVER MINES,AL                     
1935  -80    4  -50  -16   69  113  153  136  107   38 9999  -11
1936  -89 -240  -37   28  116  131  172  145  107   73   16  -73
1937 -183 -114  -24   45   92  120  153  138  103   76  -21  -53
1938  -28 -110   -4   30   84  133  158  132  147   76  -21  -38
[...more records here...]
  495-11404037187400201189PINCHER CREEK A,AL                  
1979 9999 9999 9999 9999 9999  145  168  164  139   80  -11  -13
1980 -113  -49  -27   76  110  129  166  131  113   82   14  -73
1981   -3  -24   24   57   92  115  155  177  126   55   32  -54
[...more records here...]

You can see the structural similarity to FASTA format. In this case we have sequences of temperature records with each block starting with a header line that has information about the station.

Nick had the idea of using itertools.groupby and I had the idea of using key=lambda line: line[0]==' ' (which is essentially the version of isheader that you need for Ts.txt files). If you like you can compare read_text from step2.py with its counterpart, text_to_binary, from the earlier version. Although you should note that this is not strictly a fair comparison because the earlier version is very close the original Fortran, and there are more changes than just having the idea of using itertools.groupby.

The application to FASTA files was inspired by my reading Mitchell Model’s «Bioinformatics Programming Using Python».

Bonus exercise (for the diligent reader): How bad is it to convert the output of itertools.groupby to a list, like I do in my counting example: len(list(itertools.groupby(f, key=isheader)))//2 ?

Screw the planet!

2010-02-11

The earth is getting warmer. Ecosystems are changing. Food webs around the world are upset. We are in the middle of mass extinction.

For years the green lobby have been complaining about this, finally over the last decade their voice can be heard as global warming becomes a political talking point. We can save the Earth.

Screw that! The greens are getting way too much political capital out of global warming. Sending your money to Friends of the Earth, World Wildlife Fund for Nature, the RSPB, and so on, will not solve global warming. They are not in the game.

The planet is not under threat. Our life on this planet is under threat. Runaway global warming will be bad. For any metazoan. Everything larger than than a hyrax will be wiped out. But life will go on, the Earth will be claimed once more by the Archaea and it will be business as usual really. Nature’s brief experimental dalliance with multi-cellular life will have ended. She will conclude that “further research is necessary”.

So the debate around global warming is not about saving the Earth, it is about saving our way of life on the Earth. And let’s get this straight, once we’ve “solved” global warming, we’ll have put a nuclear power plant on every (100 Km or so of) coastline, we’ll have replaced inefficient sheep pastures with huge plantations of biofuel crops, we’ll have peppered the entire countryside with wind turbines, and glazed vast deserts with solar PV farms.

We will not have saved the polar bear, the poison arrow tree frogs. All those bromeliads teetering on the brink of their fragile hilltop ecosystems in the South American rain forest? All gone. It will be our fault, collectively. And it will be sad. But solving global warming and saving your favourite obscure cute species are not the same problem.

Not in 2010

2010-02-01

Things I predict we won’t be seeing in 2010:

  • Perl 6
  • Latex 3
  • 1:25,000 scale Ordnance Survey vector data made public
  • the hottest November
  • iTunes for Linux
  • reduce in Python 3.x
  • A Green Party MP (you know, the UK parliament, not some cheap plastic Euro knock-off)
  • Follow

    Get every new post delivered to your Inbox.