Posts Tagged ‘bioinformatics’

Python: Getting FASTA with itertools.groupby


I’d like to introduce you to my new best friend: itertools.groupby.

Let’s say you want to do some processing of the genes in Danio rerio (zebrafish). Here’s the Ensembl project’s FASTA file containing all the peptide sequences:

(After I’ve decompressed it) the file looks like this:

>ENSDARP00000100686 pep:known scaffold:Zv8:Zv8_NA5235:758:1999:1 gene:ENSDARG00000076631 transcript:ENSDART00000111668
>ENSDARP00000094838 pep:known scaffold:Zv8:Zv8_NA8956:1000:1382:1 gene:ENSDARG00000070696 transcript:ENSDART00000104063
>ENSDARP00000102657 pep:known scaffold:Zv8:Zv8_NA8774:1123:1377:-1 gene:ENSDARG00000078974 transcript:ENSDART00000114102

It’s a list of peptide sequences, with each sequence consisting of a header line beginning with ‘>’:

>ENSDARP00000100686 pep:known

The header contains identification information (and is probably too long for my blog); following the header is a number of lines of peptide data (using IUPAC/IUBMB amino acid abbreviations):


The job of itertools.groupby(iterable, key=fn) is to group together the items in iterable that have the same key. For example if you have a sequence of e-mail addresses, and you use a key function that extracts the domain part of the e-mail address (def domainkey(m): return m.partition('@')[-1]) then each group that you get out of itertools.groupby will have e-mails that share the domain part after the ‘@’.

It is usual for the data sequence passed to itertools.groupby to be sorted, but it doesn’t have to be, and itertools.groupby won’t sort it. This is useful. For example, returning to our FASTA file, if we have a key function that simply identifies a line as a header or not:

def isheader(line):
    return line[0] == '>'

Then when we use this in a call to itertools.groupby the groups will simply alternate between a group of header lines, and a group of non-header lines. Like this (The printout is truncated on the right so that I had some space to write in):

(Note that all of the “header” groups are just one line long, since a sequence has only one header line)

Since the groups just alternate header, non-header, the number of groups is twice the number of sequences:

import itertools
with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  print len(list(itertools.groupby(f, key=isheader)))//2

The above code tells me the file has 28,630 peptide sequences. Of course, using len and list is a bit naughty; len doesn’t work on iterables so I had to convert it to a list, not good (on memory usage). Could’ve used sum, a generator expression, and a bit of Iverson’s Convention: sum(g for g,_ in itertools.groupby(open('Danio_rerio.Zv8.56.pep.all.fa'), key=isheader)).

The useful thing about itertools.groupby is that it lets you deal with the header and the peptide sequence as a single group, without having to worry about detecting the start of the next sequence or handling end-of-file correctly.

Say you wanted to make a dictionary that mapped from Ensembl identifiers to the peptide sequence itself:

def aspairs(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      line =
      ensembl_id = line[1:].split()[0]
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  d = dict(aspairs(f))

Notice how ensembl_id is set when a header line is read, but only used on the next spin through the loop when the sequence group is processed (in the else).

Filtering is easy too. Well, for one thing you can write a filter/comprehension on the output of aspairs, but say you didn’t want to do that, and you wanted to filter in the loop itself. You need another variable, set when the header is processed, that says whether we want to process (and yield) the sequence. Say we’re interested in only the novel genes (the second word of the header is “pep:novel”):

def aspairsnovel(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      word = line[1:].split()
      ensembl_id = word[0]
      isnovel = word[1] == 'pep:novel'
    elif isnovel:
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

Why do I mention using itertools.groupby at all? Because if you try processing a FASTA file using a for loop it gets a bit awkward:

def aspairs(file):
  seq = ''
  for line in file:
    if isheader(line):
      if seq:
        yield ensembl_id, seq
      ensembl_id = line[1:].split()[0]
      seq = ''
      seq += line.strip()
  yield ensembl_id, seq

The code to emit the (id,sequence) pair occurs when a new header line is detected (inside the if isheader, above), and it has to have a special case for the first header line in the file (because there is no sequence to emit). That code to emit the (id,sequence) pair also has to appear after the for loop to cope with the last sequence in the file. The variable used to accumulate the sequence data, seq, has to be reset in two different places: once before the loop begins, and then inside the loop every time we see a header line.


Using a while loop is even worse because now you have to detect end-of-file as well.

The idea of using itertools.groupby to parse files like this came out of a discussion between me and Nick Barnes while we were coding on Clear Climate Code.

Step 2 of the GISTEMP algorithm begins by reading the Ts.txt which looks like this:

  495-11424037187400101286BEAVER MINES,AL                     
1935  -80    4  -50  -16   69  113  153  136  107   38 9999  -11
1936  -89 -240  -37   28  116  131  172  145  107   73   16  -73
1937 -183 -114  -24   45   92  120  153  138  103   76  -21  -53
1938  -28 -110   -4   30   84  133  158  132  147   76  -21  -38
[...more records here...]
  495-11404037187400201189PINCHER CREEK A,AL                  
1979 9999 9999 9999 9999 9999  145  168  164  139   80  -11  -13
1980 -113  -49  -27   76  110  129  166  131  113   82   14  -73
1981   -3  -24   24   57   92  115  155  177  126   55   32  -54
[...more records here...]

You can see the structural similarity to FASTA format. In this case we have sequences of temperature records with each block starting with a header line that has information about the station.

Nick had the idea of using itertools.groupby and I had the idea of using key=lambda line: line[0]==' ' (which is essentially the version of isheader that you need for Ts.txt files). If you like you can compare read_text from with its counterpart, text_to_binary, from the earlier version. Although you should note that this is not strictly a fair comparison because the earlier version is very close the original Fortran, and there are more changes than just having the idea of using itertools.groupby.

The application to FASTA files was inspired by my reading Mitchell Model’s «Bioinformatics Programming Using Python».

Bonus exercise (for the diligent reader): How bad is it to convert the output of itertools.groupby to a list, like I do in my counting example: len(list(itertools.groupby(f, key=isheader)))//2 ?