I hate submitting to genbank. It is a royal pain, and often I use work arounds to get my data submitted, eg. using the awesome Short Read Archive. For this project, I have two complete genomes and I want them to be in GenBank, so here are my notes and thoughts about the process.
Second attempt. I put this at the top since you probably want to know what works first. Here is a web page describing some of the steps. I’m trying and will provide some scripts.
First, get everything together:
- A genbank file downloaded from RAST
- A bioproject ID that you can get from here
- A biosample ID that you can get from here
- Complete this form and download the template file. Does anyone (other than NCBI) use a FAX?
- Your Locus Tag prefix (probably from your bioproject)
- Your protein id prefix (something from your lab)
- Download the rast2sqn.sh file from here
The stuff below here didn’t work. I had an error report 1,015 lines long. I don’t understand why NCBI can identify these errors and not propose solutions to the errors. It is just lazy offsetting the work to everyone else.
It would be so much less work for science if NCBI staff would work with developers at RAST to fix the errors and come up with a one button solution to this problem. But no, everyone has to suffer because of petty politics.
First, I am using the BankIt Online Submission form. Mostly because I refuse to use a piece of software that is so old you need to tell it that you have Internet access. You need to login with your pubmed id.
Before you start you need a couple of things:
- organism name as it appears in the NCBI taxonomy
- fasta file of the DNA sequence. Here is a description of what they like at Genbank. In particular, note that you need to include the organism name as it appears in the NCBI Taxonomy, so make sure you get that!
- five column feature table (see below).
Most of the form is self explanatory, and they will ask you some questions. Answer them. However, the five column feature table is a pain in the butt. There is a description of what they want available online.
I have written this code that uses BioPython to convert a gbk file to this five column format. There are a couple of random and unpredictable errors, and you’ll have to iterate this process many times, editing the genbank file or tsv file and repeating the upload until you get the right response.