I’d like to introduce you to my new best friend:
Let’s say you want to do some processing of the genes in Danio rerio (zebrafish). Here’s the Ensembl project’s FASTA file containing all the peptide sequences:
(After I’ve decompressed it) the file looks like this:
>ENSDARP00000100686 pep:known scaffold:Zv8:Zv8_NA5235:758:1999:1 gene:ENSDARG00000076631 transcript:ENSDART00000111668 LKSHSNTQPYLTVDSSGKYLRSQLHSQWHPINEKFRCEVCGRCFHSNTNLLVHYAVHTGE KPYKCSFCEKGFSQKGNLQAHERIHRGEKPFSCVICGRSFTQKVCLRNHERIHRGEKPFT CMTCGKGFTQKVTLQQHLAVHDKTTKRLCITCSSTFRTSSAKAVHKPIQSSNMP >ENSDARP00000094838 pep:known scaffold:Zv8:Zv8_NA8956:1000:1382:1 gene:ENSDARG00000070696 transcript:ENSDART00000104063 FLISSVDNSEYMRNGDFLPTRLQAQQDAVNIICHSKTRSNPENNVGLITMANNCEVLTTL TPDTGRILSKLHAVQPRGVISFCTGIRVAHV >ENSDARP00000102657 pep:known scaffold:Zv8:Zv8_NA8774:1123:1377:-1 gene:ENSDARG00000078974 transcript:ENSDART00000114102 MASAPPSSPHNRSRSEDEEDPVDTMISKTGCAELHYTLLDCMAEHQDWRKCQTEVLKFKE CMSAYQNTRKEQLLKQKTSATESV [...]
It’s a list of peptide sequences, with each sequence consisting of a header line beginning with ‘>’:
>ENSDARP00000100686 pep:known …
The header contains identification information (and is probably too long for my blog); following the header is a number of lines of peptide data (using IUPAC/IUBMB amino acid abbreviations):
The job of
itertools.groupby(iterable, key=fn) is to group together the items in iterable that have the same key. For example if you have a sequence of e-mail addresses, and you use a key function that extracts the domain part of the e-mail address (
def domainkey(m): return m.partition('@')[-1]) then each group that you get out of
itertools.groupby will have e-mails that share the domain part after the ‘@’.
It is usual for the data sequence passed to
itertools.groupby to be sorted, but it doesn’t have to be, and
itertools.groupby won’t sort it. This is useful. For example, returning to our FASTA file, if we have a key function that simply identifies a line as a header or not:
def isheader(line): return line == '>'
Then when we use this in a call to
itertools.groupby the groups will simply alternate between a group of header lines, and a group of non-header lines. Like this (The printout is truncated on the right so that I had some space to write in):
(Note that all of the “header” groups are just one line long, since a sequence has only one header line)
Since the groups just alternate header, non-header, the number of groups is twice the number of sequences:
import itertools with open('Danio_rerio.Zv8.56.pep.all.fa') as f: print len(list(itertools.groupby(f, key=isheader)))//2
The above code tells me the file has 28,630 peptide sequences. Of course, using
list is a bit naughty;
len doesn’t work on iterables so I had to convert it to a list, not good (on memory usage). Could’ve used
sum, a generator expression, and a bit of Iverson’s Convention:
sum(g for g,_ in itertools.groupby(open('Danio_rerio.Zv8.56.pep.all.fa'), key=isheader)).
The useful thing about
itertools.groupby is that it lets you deal with the header and the peptide sequence as a single group, without having to worry about detecting the start of the next sequence or handling end-of-file correctly.
Say you wanted to make a dictionary that mapped from Ensembl identifiers to the peptide sequence itself:
def aspairs(file): for header,group in itertools.groupby(f, isheader): if header: line = group.next() ensembl_id = line[1:].split() else: sequence = ''.join(line.strip() for line in group) yield ensembl_id, sequence with open('Danio_rerio.Zv8.56.pep.all.fa') as f: d = dict(aspairs(f))
Notice how ensembl_id is set when a header line is read, but only used on the next spin through the loop when the sequence group is processed (in the
Filtering is easy too. Well, for one thing you can write a filter/comprehension on the output of
aspairs, but say you didn’t want to do that, and you wanted to filter in the loop itself. You need another variable, set when the header is processed, that says whether we want to process (and yield) the sequence. Say we’re interested in only the novel genes (the second word of the header is “pep:novel”):
def aspairsnovel(file): for header,group in itertools.groupby(f, isheader): if header: line =group.next() word = line[1:].split() ensembl_id = word isnovel = word == 'pep:novel' elif isnovel: sequence = ''.join(line.strip() for line in group) yield ensembl_id, sequence
Why do I mention using
itertools.groupby at all? Because if you try processing a FASTA file using a
for loop it gets a bit awkward:
def aspairs(file): seq = '' for line in file: if isheader(line): if seq: yield ensembl_id, seq ensembl_id = line[1:].split() seq = '' else: seq += line.strip() yield ensembl_id, seq
The code to emit the (id,sequence) pair occurs when a new header line is detected (inside the
if isheader, above), and it has to have a special case for the first header line in the file (because there is no sequence to emit). That code to emit the (id,sequence) pair also has to appear after the
for loop to cope with the last sequence in the file. The variable used to accumulate the sequence data, seq, has to be reset in two different places: once before the loop begins, and then inside the loop every time we see a header line.
while loop is even worse because now you have to detect end-of-file as well.
The idea of using
itertools.groupby to parse files like this came out of a discussion between me and Nick Barnes while we were coding on Clear Climate Code.
Step 2 of the GISTEMP algorithm begins by reading the
Ts.txt which looks like this:
495-11424037187400101286BEAVER MINES,AL 1935 -80 4 -50 -16 69 113 153 136 107 38 9999 -11 1936 -89 -240 -37 28 116 131 172 145 107 73 16 -73 1937 -183 -114 -24 45 92 120 153 138 103 76 -21 -53 1938 -28 -110 -4 30 84 133 158 132 147 76 -21 -38 [...more records here...] 495-11404037187400201189PINCHER CREEK A,AL 1979 9999 9999 9999 9999 9999 145 168 164 139 80 -11 -13 1980 -113 -49 -27 76 110 129 166 131 113 82 14 -73 1981 -3 -24 24 57 92 115 155 177 126 55 32 -54 [...more records here...]
You can see the structural similarity to FASTA format. In this case we have sequences of temperature records with each block starting with a header line that has information about the station.
Nick had the idea of using
itertools.groupby and I had the idea of using
key=lambda line: line==' ' (which is essentially the version of
isheader that you need for
Ts.txt files). If you like you can compare
read_text from step2.py with its counterpart,
text_to_binary, from the earlier version. Although you should note that this is not strictly a fair comparison because the earlier version is very close the original Fortran, and there are more changes than just having the idea of using
The application to FASTA files was inspired by my reading Mitchell Model’s «Bioinformatics Programming Using Python».
Bonus exercise (for the diligent reader): How bad is it to convert the output of
itertools.groupby to a list, like I do in my counting example:
len(list(itertools.groupby(f, key=isheader)))//2 ?