How can we generate a list of all the lengths of all the proteins [in a specific group] in genbank? Its easy with ftp!
Seth Bordenstein wanted to know the lengths of all the proteins in all the phages on genbank:
Question for phage or genome gurus: know a way to quickly determine the sizes of the largest genes in phage genomes?
— Seth Bordenstein (@Symbionticism) May 6, 2016
Seemed like a fun exercise for a few minutes of coding. You could get all the proteins from phantome (from the Downloads folder) or you could download the two viral protein files from RefSeq (from their ftp site).
Printing out the lengths of those is trivial. This command will uncompress file file, print the ids and lengths, and sort them from longest to shortest.
gunzip -c phage_proteins_1462100402.fasta.gz | perl -ne 'chomp; if (s/^>//) {if ($id) {print "$id\t", length($seq), "\n"} $id=$_; $seq=""} else {$seq.=$_} END {print "$id\t", length($seq), "\n"}' | sort -t$'\t' -k2 -nr
You probably want to redirect the output of that command … here is the output from that command
However, can we get the data from GenBank and parse it in one go? Here is some simple code to get all the phg files, but you can change it, for eg, for bacteria. We use the FTP module from python to get a list of all the files on GenBank, and store all the files we want to retrieve. We get each of those, again using ftplib, and parse them using biopython to get the lengths of the sequences.
This code is in our github repository
import os import sys import StringIO from Bio import SeqIO from ftplib import FTP import gzip r = StringIO.StringIO() phgfiles = [] def read_data(data): r.write(data) def phage_files(filename): if 'phg' in filename: phgfiles.append(filename) def read_files(): for filename in phgfiles: sys.stderr.write("Processing phage file: " + filename + "\n") ftp.retrbinary('RETR ' + filename, read_data) r.seek(0) for seq in SeqIO.parse(gzip.GzipFile(fileobj=r), 'genbank'): lt = "" for feature in seq.features: if 'protein_id' in feature.qualifiers: lt = feature.qualifiers['protein_id'][0] if 'locus_tag' in feature.qualifiers: lt = feature.qualifiers['locus_tag'][0] if 'translation' in feature.qualifiers: print("{}\t{}\t{}".format(seq.id, lt, len(feature.qualifiers['translation'][0]))) r.close() # first find all the phg files in genbank ftp = FTP('ftp.ncbi.nlm.nih.gov') ftp.login() ftp.cwd('genbank/') ftp.retrlines('NLST', phage_files) read_files()
Here is the output from that script, I also sorted this by length.
The longest protein?
A hypothetical protein from Cyanophage P-RSM6