Python: Getting FASTA with itertools.groupby

2010-02-22

I’d like to introduce you to my new best friend: itertools.groupby.

Let’s say you want to do some processing of the genes in Danio rerio (zebrafish). Here’s the Ensembl project’s FASTA file containing all the peptide sequences: ftp://ftp.ensembl.org/pub/release-56/fasta/danio_rerio/pep/Danio_rerio.Zv8.56.pep.all.fa.gz.

(After I’ve decompressed it) the file looks like this:

>ENSDARP00000100686 pep:known scaffold:Zv8:Zv8_NA5235:758:1999:1 gene:ENSDARG00000076631 transcript:ENSDART00000111668
LKSHSNTQPYLTVDSSGKYLRSQLHSQWHPINEKFRCEVCGRCFHSNTNLLVHYAVHTGE
KPYKCSFCEKGFSQKGNLQAHERIHRGEKPFSCVICGRSFTQKVCLRNHERIHRGEKPFT
CMTCGKGFTQKVTLQQHLAVHDKTTKRLCITCSSTFRTSSAKAVHKPIQSSNMP
>ENSDARP00000094838 pep:known scaffold:Zv8:Zv8_NA8956:1000:1382:1 gene:ENSDARG00000070696 transcript:ENSDART00000104063
FLISSVDNSEYMRNGDFLPTRLQAQQDAVNIICHSKTRSNPENNVGLITMANNCEVLTTL
TPDTGRILSKLHAVQPRGVISFCTGIRVAHV
>ENSDARP00000102657 pep:known scaffold:Zv8:Zv8_NA8774:1123:1377:-1 gene:ENSDARG00000078974 transcript:ENSDART00000114102
MASAPPSSPHNRSRSEDEEDPVDTMISKTGCAELHYTLLDCMAEHQDWRKCQTEVLKFKE
CMSAYQNTRKEQLLKQKTSATESV
[...]

It’s a list of peptide sequences, with each sequence consisting of a header line beginning with ‘>’:

>ENSDARP00000100686 pep:known

The header contains identification information (and is probably too long for my blog); following the header is a number of lines of peptide data (using IUPAC/IUBMB amino acid abbreviations):

LKSHSNTQPYLTVDSSGKYLRSQLHSQWHP

The job of itertools.groupby(iterable, key=fn) is to group together the items in iterable that have the same key. For example if you have a sequence of e-mail addresses, and you use a key function that extracts the domain part of the e-mail address (def domainkey(m): return m.partition('@')[-1]) then each group that you get out of itertools.groupby will have e-mails that share the domain part after the ‘@’.

It is usual for the data sequence passed to itertools.groupby to be sorted, but it doesn’t have to be, and itertools.groupby won’t sort it. This is useful. For example, returning to our FASTA file, if we have a key function that simply identifies a line as a header or not:

def isheader(line):
    return line[0] == '>'

Then when we use this in a call to itertools.groupby the groups will simply alternate between a group of header lines, and a group of non-header lines. Like this (The printout is truncated on the right so that I had some space to write in):

(Note that all of the “header” groups are just one line long, since a sequence has only one header line)

Since the groups just alternate header, non-header, the number of groups is twice the number of sequences:

import itertools
with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  print len(list(itertools.groupby(f, key=isheader)))//2

The above code tells me the file has 28,630 peptide sequences. Of course, using len and list is a bit naughty; len doesn’t work on iterables so I had to convert it to a list, not good (on memory usage). Could’ve used sum, a generator expression, and a bit of Iverson’s Convention: sum(g for g,_ in itertools.groupby(open('Danio_rerio.Zv8.56.pep.all.fa'), key=isheader)).

The useful thing about itertools.groupby is that it lets you deal with the header and the peptide sequence as a single group, without having to worry about detecting the start of the next sequence or handling end-of-file correctly.

Say you wanted to make a dictionary that mapped from Ensembl identifiers to the peptide sequence itself:

def aspairs(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      line = group.next()
      ensembl_id = line[1:].split()[0]
    else:
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

with open('Danio_rerio.Zv8.56.pep.all.fa') as f:
  d = dict(aspairs(f))

Notice how ensembl_id is set when a header line is read, but only used on the next spin through the loop when the sequence group is processed (in the else).

Filtering is easy too. Well, for one thing you can write a filter/comprehension on the output of aspairs, but say you didn’t want to do that, and you wanted to filter in the loop itself. You need another variable, set when the header is processed, that says whether we want to process (and yield) the sequence. Say we’re interested in only the novel genes (the second word of the header is “pep:novel”):

def aspairsnovel(file):
  for header,group in itertools.groupby(f, isheader):
    if header:
      line =group.next()
      word = line[1:].split()
      ensembl_id = word[0]
      isnovel = word[1] == 'pep:novel'
    elif isnovel:
      sequence = ''.join(line.strip() for line in group)
      yield ensembl_id, sequence

Why do I mention using itertools.groupby at all? Because if you try processing a FASTA file using a for loop it gets a bit awkward:

def aspairs(file):
  seq = ''
  for line in file:
    if isheader(line):
      if seq:
        yield ensembl_id, seq
      ensembl_id = line[1:].split()[0]
      seq = ''
    else:
      seq += line.strip()
  yield ensembl_id, seq

The code to emit the (id,sequence) pair occurs when a new header line is detected (inside the if isheader, above), and it has to have a special case for the first header line in the file (because there is no sequence to emit). That code to emit the (id,sequence) pair also has to appear after the for loop to cope with the last sequence in the file. The variable used to accumulate the sequence data, seq, has to be reset in two different places: once before the loop begins, and then inside the loop every time we see a header line.

Awkward.

Using a while loop is even worse because now you have to detect end-of-file as well.

The idea of using itertools.groupby to parse files like this came out of a discussion between me and Nick Barnes while we were coding on Clear Climate Code.

Step 2 of the GISTEMP algorithm begins by reading the Ts.txt which looks like this:

  495-11424037187400101286BEAVER MINES,AL                     
1935  -80    4  -50  -16   69  113  153  136  107   38 9999  -11
1936  -89 -240  -37   28  116  131  172  145  107   73   16  -73
1937 -183 -114  -24   45   92  120  153  138  103   76  -21  -53
1938  -28 -110   -4   30   84  133  158  132  147   76  -21  -38
[...more records here...]
  495-11404037187400201189PINCHER CREEK A,AL                  
1979 9999 9999 9999 9999 9999  145  168  164  139   80  -11  -13
1980 -113  -49  -27   76  110  129  166  131  113   82   14  -73
1981   -3  -24   24   57   92  115  155  177  126   55   32  -54
[...more records here...]

You can see the structural similarity to FASTA format. In this case we have sequences of temperature records with each block starting with a header line that has information about the station.

Nick had the idea of using itertools.groupby and I had the idea of using key=lambda line: line[0]==' ' (which is essentially the version of isheader that you need for Ts.txt files). If you like you can compare read_text from step2.py with its counterpart, text_to_binary, from the earlier version. Although you should note that this is not strictly a fair comparison because the earlier version is very close the original Fortran, and there are more changes than just having the idea of using itertools.groupby.

The application to FASTA files was inspired by my reading Mitchell Model’s «Bioinformatics Programming Using Python».

Bonus exercise (for the diligent reader): How bad is it to convert the output of itertools.groupby to a list, like I do in my counting example: len(list(itertools.groupby(f, key=isheader)))//2 ?

8 Responses to “Python: Getting FASTA with itertools.groupby”

  1. Stephan Says:

    The for-loop implementation of the aspairs function has another (minor) drawback: it does not gracefully handle empty files. The groupby implementation does.

    • drj11 Says:

      Indeed, I had forgotten that (I now remember making exactly the same comment to Nick when I was discussing the corresponding code in Clear Climate Code).

      Probably not a practical concern for this example, but definitely important for making it robust for the general case.

  2. drj11 Says:

    The long lines look so much better in Google Reader. RSS For The Win!

  3. drj11 Says:

    By the way, I half expected someone to come along and say “don’t be stupid, no-one in their right mind parses FASTA files by hand, everyone uses this awesome tool _over here_”.

    Anyone?

  4. bc Says:

    Excuse the plug, but this is exactly the sort application for which I made sendtools (http://pypi.python.org/pypi/sendtools).

    itertools.group_by only gets you so far. itertools is great as long as you only need to collect a single result (processing just the “novel” genes, for example)… But what if you want to process both the novel genes and the “standard” ones, handling them slightly differently? Using itertools, you need to make two passes over your source iterator (or revert to a procedural style). With sendtools, you can set up the processing for both and execute it in a single ‘send’ operation.

  5. Nicholi P. Says:

    HELP?
    i need to map 20bp sequences produces by a gene expression analysis to the contigs produced by a 454 assemblage, here is the kicker i dont care about the tags mapped to the ORF range. i have a file with the ORF ranges for each contig in the assemblage i just can’t combine all 3 ideas, im a noob so sorry if this is super cakie

  6. Santhosh Hegde Says:

    I have a genome as fasta file.and I have a small sequence.I want to find whether that sequence is present in that genome or not.As well as I want to know the exact position of the genome if the genome has that sequence.How to find this using python?


Leave a comment