How to find the lengths of all the proteins in genbank

How can we generate a list of all the lengths of all the proteins [in a specific group] in genbank? Its easy with ftp!

Seth Bordenstein wanted to know the lengths of all the proteins in all the phages on genbank:

Seemed like a fun exercise for a few minutes of coding. You could get all the proteins from phantome (from the Downloads folder) or you could download the two viral protein files from RefSeq (from their ftp site).

Printing out the lengths of those is trivial. This command will uncompress file file, print the ids and lengths, and sort them from longest to shortest.

gunzip -c phage_proteins_1462100402.fasta.gz | perl -ne 'chomp; if (s/^>//) {if ($id) {print "$id\t", length($seq), "\n"} $id=$_; $seq=""} else {$seq.=$_} END {print "$id\t", length($seq), "\n"}' | sort -t$'\t' -k2 -nr

You probably want to redirect the output of that command … here is the output from that command

However, can we get the data from GenBank and parse it in one go? Here is some simple code to get all the phg files, but you can change it, for eg, for bacteria. We use the FTP module from python to get a list of all the files on GenBank, and store all the files we want to retrieve. We get each of those, again using ftplib, and parse them using biopython to get the lengths of the sequences.

This code is in our github repository

import os
import sys

import StringIO
from Bio import SeqIO
from ftplib import FTP
import gzip


r = StringIO.StringIO()

phgfiles = []

def read_data(data):
 r.write(data)

def phage_files(filename):
 if 'phg' in filename:
 phgfiles.append(filename)


def read_files():
 for filename in phgfiles:
 sys.stderr.write("Processing phage file: " + filename + "\n")
 ftp.retrbinary('RETR ' + filename, read_data)
 r.seek(0)

 for seq in SeqIO.parse(gzip.GzipFile(fileobj=r), 'genbank'):
 lt = ""
 for feature in seq.features:
 if 'protein_id' in feature.qualifiers:
 lt = feature.qualifiers['protein_id'][0]
 if 'locus_tag' in feature.qualifiers:
 lt = feature.qualifiers['locus_tag'][0]
 if 'translation' in feature.qualifiers:
 print("{}\t{}\t{}".format(seq.id, lt, len(feature.qualifiers['translation'][0])))

 r.close()


# first find all the phg files in genbank

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
ftp.cwd('genbank/')
ftp.retrlines('NLST', phage_files)
read_files()

Here is the output from that script, I also sorted this by length.

The longest protein?

A hypothetical protein from Cyanophage P-RSM6